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ABSTRACT 


Empirical  evidence  suggests  unusual  or  outlying  observations  in  data  sets  are 
much  more  prevalent  than  one  might  expect;  5  to  10%  on  average  for  many  industries. 
This  research  addresses  multiple  outliers  in  the  linear  regression  model.  Although 
reliable  for  a  single  or  a  few  outliers,  standard  diagnostic  techniques  from  an  ordinary 
least  squares  (OLS)  fit  can  fail  to  identify  multiple  outliers.  The  parameter  estimates, 
diagnostic  quantities  and  model  inferences  from  the  contaminated  data  set  can  be 
significantly  different  from  those  obtained  with  the  clean  data.  The  researcher  requires  a 
dependable  method  to  identify  and  accommodate  these  multiple  outliers. 

This  research  tests  both  direct  methods  from  algorithms  and  indirect  methods 
from  robust  regression  estimators  to  identify  multiple  outliers.  A  comprehensive  Monte 
Carlo  simulation  study  evaluates  the  impact  that  outlier  density  and  geometry,  regressor 
variable  dimension,  and  outlying  distance  have  on  numerous  published  methods.  The 
performance  study  focuses  on  outlier  configurations  likely  to  be  encountered  in  practice 
and  uses  a  designed  experiment  approach.  The  results  for  each  scenario  provide  insight 
and  limitations  in  performance  for  each  technique.  Recommendations  are  given  for  each 
technique. 

OLS  is  the  optimal  regression  estimator  imder  a  set  of  assumptions  on  the 
distribution  of  the  error  term  and  predictor  variables.  Compound  robust  regression 
estimators  have  been  proposed  as  alternatives  when  some  OLS  assumptions  fail, 
rnmpoiind  estimators  can  accommodate  multiple  outliers  and  limit  the  influence  of  the 
observations  with  remote  levels  of  predictor  variables.  This  research  proposes  a  new 


111 


compound  estimator  that  is  more  effective  for  extreme  observations  in  X-space  and  high- 
dimension  than  currently  published  methods. 

This  research  also  addresses  the  variable  selection  problem  for  compound  robust 
regression  estimators.  Estimating  model  prediction  error  with  resampling  methods 
(bootstrap  and  cross-validation)  is  the  most  effective  approach  to  the  variable  selection 
problem  in  OLS.  Current  research  suggests  that  the  best  method  for  variable  selection  is 
to  select  the  model  with  the  minimum  value  of  prediction  error  from  a  modified  bootstrap 
procedure.  The  modified  procedure  uses  a  bootstrap  sanqile  size  significantly  less  than 
the  original  sample  size.  A  selection  criterion  is  proposed  based  on  a  low  prediction  error 
(not  necessarily  minimum)  with  the  fewest  predictor  variables.  The  proposed  criterion 
often  provides  superior  results  to  the  minimum  prediction  error  criterion  and  does  not 
require  the  modified  bootstrap  procedure  to  achieve  good  results  in  OLS.  Monte  Carlo 
simulation  results  suggest  that  the  proposed  criterion  i$  also  effective  for  compound 
estimators  in  contaminated  samples.  This  research  shows  the  viability  of  combining  the 
two  computationally  intense  procedures  of  resampling  methods  and  compound  estimation 
to  achieve  accurate  model  selection  in  the  presence  of  multiple  outliers. 
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Chapter  1 
Introduction 


1.1  Bacl^round  and  Motivation  for  this  Research 

The  goal  of  the  field  of  statistics  is  to  transform  raw  data  into  useful  information 
for  decision  making.  By  its  nature,  statistics  is  not  an  exact  science  and  an  approximation 
to  an  imderlying  process  is  often  based  on  a  sample  of  observations  fi’om  the  total 
population  of  interest.  A  common  objective  in  statistics  is  to  identify  an  appropriate 
transformation  fi’om  a  sample  to  relate  a  response  (dependent)  variable  to  a  set  of 
independent  variables.  Linear  regression  is  the  customary  method  used  to 
mathematically  model  a  response  variable  as  a  function  of  the  regressor  (independent) 
variables.  Regression  analysis  is  used  in  all  fields  of  engineering,  science,  and 
management.  Proliferation  of  the  method  continues  because  common  software  packages 
include  regression  options. 

The  regression  model  for  n  observations  and  k  regressor  variables  can  be 
described  in  terms  matrices  as  y  =  Xp  +e  where  y  is  the  n  x  1  vector  of  observed 
response  values  and  X  is  the  observed  nxp  matrix  of  k regressor  variables  augmented 
with  a  column  of  ones.  P  is  an  unknown  px\  vector  of  regression  coefficients  and  6  is 
the  nx\  vector  of  error  terms.  The  e  vector  is  critical.  If  it  is  identically  0,  then  the 
process  modeled  is  deterministic  (e.g.  F=ma).  However,  in  practice,  it  is  not  identically  0 
and  the  relationship  between  the  response  and  predictor  variables  is  not  exact.  That  is, 
given  the  same  set  of  regressor  variables,  the  response  values  will  not  necessarily  be  the 
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same.  The  goal  of  regression  analysis  is  to  find  a  good  estimate  of  the  unknown 
regression  coefficients  3  fi’om  the  observed  sample. 

The  usual  estimator  of  P  comes  from  the  method  of  ordinary  least  squares  (OLS) 
discovered  independently  by  Gauss  in  1795  and  Legendre  in  1805.  OLS  minimizes  the 
sum  of  the  squared  distances  for  all  points  from  the  actual  observation  to  the  regression 
surface.  The  least  squares  estimator  is  attractive  because  of  computational  simplicity, 
availability  of  software,  and  statistical  optimality  properties.  From  the  Gauss-Markov 
theorem,  least  squares  is  always  the  best  linear  imbiased  estimator  (BLUE).  BLUE 
means  among  all  unbiased  estimators,  OLS  has  the  minimum  variance.  If  e  is 
assumed  to  be  normally,  independently  distributed  with  mean  0  and  variance  cr  I ,  least 

squares  is  the  uniformly  minimum  variance  unbiased  estimator.  Under  this  assumption, 
inference  procedures  such  as  hypothesis  tests,  confidence  intervals,  and  prediction 
intervals  are  powerful.  However,  if  b  is  not  normally  distributed,  then  the  OLS 
parameter  estimates  and  inferences  can  be  flawed. 

Violation  of  the  NID  distribution  of  the  error  term  can  occur  when  there  are  one 
or  more  outliers  in  the  data  set.  An  outlier  is  an  observation  that  is  inconsistent  with  the 
remainder  of  the  data  and  it  is  not  unusual  to  see  an  average  of  10%  outliers  in  data  sets 
for  some  processes  (Barnett  and  Lewis,  1994  and  Hampel  et  al.,  1986).  Some  of  the 
sources  of  outliers  are  errors  in  data  entry  or  measurement,  the  inadvertent  inclusion  of  an 
observation  from  another  population  or  a  plausible  event. 

To  illustrate  the  effect  on  least  squares  regression  of  an  outlier,  consider  the  pilot 
plant  data  from  Daniel  and  Wood  (1971)  where  the  extraction  rate  is  thought  to  be  a 
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predictor  of  the  acid  content  measured  by  titration.  The  two  fits  displayed  in  Figure  1.1 
are  of  the  original  data  and  a  hypothetical  situation  where  a  single  transcription  error  is 
made  on  an  extraction  rate  that  changes  it  fi’om  37  to  370  (Rousseeuw  and  Leroy,  1987). 
The  correct  least  squares  fit  is  j)  =  35.5  +  0.32x  where  y  is  the  ejq)ected  response  value  of 
y  conditional  on  the  level  of  jc.  The  intercept  and  slope  estimates  for  the  model  fit  with 
the  outlier  are  distinctly  different,  y  =  58.9  +  0.08x .  Unless  interest  is  confined  to  the 
region  aroimd  mean  level  of  the  extraction  rate,  the  outlier-contaminated  model  is  poor. 
Both  of  the  parameter  estimates  (intercept  and  slope)  have  changed  too  much  from  the 
true  imderlying  relationship  to  be  considered  a  meaningfiil  description  for  the  majority  of 
the  data.  The  OLS  estimates  have  broken  down  after  a  single  anomalous  observation. 
Breakdown  is  a  critical  concept  for  this  research.  Huber’s  (1981)  operational  definition  is 
the  smallest  fraction  of  data  contamination  needed  to  cause  an  arbitrarily  large  change  in 
the  parameter  estimates. 

Figure  1 . 1  clearly  indicates  that  it  is  of  interest  to  the  regression  practitioner  to 
have  a  set  of  reliable  tools  to  detect  outlying  observations.  Fortvuiately,  isolating  a  single 
or  a  few  outliers  in  OLS  is  relatively  easy  with  routine  diagnostics  (e.g.  Cook’s  D, 
DFFITS,  scaled  residuals  and  residual  graphics)  supplied  by  most  statistical  analysis 
software.  Multiple  outliers  in  a  sample  have  a  similar,  if  not  worse,  effect  on  the  least 
squares  parameter  estimates  and  inference  as  displayed  in  Figure  1.1.  However,  the 
standard  diagnostic  measures  often  &il  to  indicate  anything  unusual  about  these 
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Pilot  Plant  Data 


Figure  1.1.  Breakdown  of  the  ordinary  least  squares  (OLS)  estimator 
in  the  modified  pilot  plant  data  ^aniel  and  Wood,  1971). 

observations.  Also,  these  diagnostics  can  signal  that  clean  observations  are  outliers.  The 
former  symptom  of  multiple  outliers  is  known  as  masking  and  the  latter  is  termed 
swamping.  There  are  several  methods  proposed  in  the  literature  that  attack  the  multiple 
oirtlier  identification  problem;  yet,  there  is  little  guidance  for  the  practitioner  on  which 
methods  perform  well  in  representative  outlier  scenarios.  Few  methods  are  readily 
available  on  standard  statistical  packages. 

If  the  multiple  outliers  are  successfully  identified,  a  decision  has  to  be  made  on 
what  to  do  with  them.  These  aberrant  observations  could  be  left  in  the  analysis.  Figure 
1.1  graphically  depicts  the  consequences  of  leaving  an  outlier  in  the  analysis. 

Conversely,  the  outliers  could  be  removed  entirely  from  the  analysis.  If  the  outliers  are 
plausible  events,  then  these  observations  may  be  the  most  important  ones  in  the  sample. 
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Dismissal  of  these  outliers  from  the  analysis  could  be  a  missed  opportunity  to 
characterize  the  process  at  certain  operating  conditions.  A  compromise  between 
including  and  deleting  the  outliers  is  to  downweight  their  influence  on  the  regression 
surface.  Robust  regression  estimators  have  been  proposed  as  alternatives  to  OLS  to 
downweight  observations  as  a  function  of  “outlyingness”  in  parameter  estimation. 

There  has  been  a  large  body  of  literature  in  recent  years  developing  the  theory  and 
practice  of  robust  regression  estimators.  Typically,  these  estimators  require  significant 
computational  resources  because  of  nonlinear  solutions  or  the  requirement  to  search 
numerous  subsets  of  the  data  to  satisfy  a  constrained  objective  function.  Ironically,  the 
first  robust  regression  estimator  pre-dates  OLS  by  nearly  a  half  century.  The  Li  or  least 
absolute  value  estimator  (Boscovich,  1 757)  is  particularly  well-suited  for  those  heavy¬ 
tailed  distributions  (e.g.  double  exponential)  that  can  generate  outliers.  However,  this 
and  many  other  robust  regression  estimators  are  not  able  to  accommodate  multiple 
outliers.  That  is,  they  are  not  high-breakdown  estimators  and  foil  with  only  a  modest 
amoimt  of  outliers.  The  most  often  used  high-breakdown  estimators  are  the  Least 
Median  of  Squares  (Rousseeuw,  1983),  Least  Trimmed  Sum  of  Squares  (Rousseeuw, 
1984)  and  S'-estimators  (Rousseeuw  and  Yohai,  1984).  The  problem  with  these 
estimators  is  that  they  can  fail  if  the  outliers  have  extreme  values  in  the  regressor 
variables  (high-leverage  points).  The  ability  of  a  robust  estimator  to  accommodate  high- 
leverage  outliers  is  called  bounded-influence.  The  LMS,  LTS,  and  iS-estimators  are  also 
not  efficient  estimators.  They  do  not  fit  data  sets  particularly  well  when  there  are  no 
outliers  present.  Recently,  a  class  of  robust  regression  estimators  has  been  proposed  that 
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simultaneously  achieve  all  three  properties  (Simpson,  et  al.  1992,  Coakley  and 
Hettmansperger,  1993,  and  Simpson  and  Montgomery,  1998).  These  compound 
estimators  have  the  potential  not  only  to  identify  a  wide  range  of  multiple  outlier 
configurations,  but  also  to  accommodate  them  in  a  model.  Hampel  (1997)  recommends 
such  an  approach  to  make  the  robust  regression  and  regression  die^nostic  fields 
complementary  rather  than  antagonistic. 

Therefore,  one  method  to  detect  multiple  outliers  in  regression  is  to  examine  the 
final  weights  (between  0  and  1)  that  the  robust  regression  estimator  assigns  each 
observation.  Observations  with  weights  close  to  0  are  candidates  for  outliers.  Residual 
values  from  a  robust  fit  can  also  be  used  to  identify  the  outliers.  There  are  also  more 
direct  multiple  outlier  detection  procedures  in  the  literature  that  use  specially  designed 
algorithms.  There  is  little  guidance  and  few  empirical  studies  on  which  methods  work 
best. 

A  tacit  assumption  to  this  point  is  that  the  correct  regressor  variables  are  specified 
for  the  model.  Most  regression  modeling  requires  selection  of  the  subset  of  regressor 
variables  fi'om  a  larger  pool  thought  to  be  related  to  the  response.  Outliers  confound  the 
variable  selection  process  because  a  variable  that  truly  has  no  effect  on  the  response  may 
appear  to  be  significant  because  it  is  fitting  the  outliers.  Equally  troublesome,  the  outliers 
may  mask  a  significant  variable.  As  an  example,  consider  the  modified  Gunst  and  Mason 
(1980)  data  set  created  in  Section  5.6.2  that  has  «  =  40  observations,  k  =  A  regressor 
variables  and  four  outliers.  The  OLS  parameter  estimates  along  with  those  fi-om  the 
proposed  compound  estimator  in  this  research  are  displayed  in  Table  1.1.  For  this  data 
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set,  it  is  known  p '  =  [2,  0, 0, 4,  8].  Least  squares  has  fit  the  outliers  with  the  two  inactive 
regressor  variables  Xi  and  xa.  Note  that  the  compound  estimator  is  resistant  to  the 
outliers. 


Table  1.1.  Least  squares  and  proposed  compound  estimates  of  the 
regression  parameters  in  the  modified  Gunst  and  Mason  (1980)  data. 


Po 

P2 

A 

Pa 

true  parameter 

2.00 

0.00 

0.00 

4.00 

8.00 

least  squares  estimate 

1.71 

9.85 

1 .15 

-2.62 

10.17 

compound  estimate 

2.24 

-0.23 

0.61 

3.35 

8.23 

Selection  of  the  best  subset  of  variables  is  a  critical  part  of  the  regression  model 
building  process.  Again,  there  are  numerous  methods  and  criteria  available  to  the 
practitioner  with  nmny  accessible  in  standard  statistical  analysis  software.  Recent 
developments  have  suggested  that  resan:q)ling  methods  are  better  suited  for  the  variable 
selection  problem  (Breiman,  1995,  Shao,  1996,  and  Davison  and  Hinkley,  1997).  It  is  not 
known  how  resampling  methods  perform  with  multiple  outliers  in  the  data. 

1.2  Statement  of  the  Problem 

This  research  was  introduced  by  defining  the  goal  of  the  field  of  statistics. 

Staudte  and  Sheather  (1990)  claim  that  a  better  description  with  respect  to  robust 
estimation  may  be  the  “battlefield  of  statistics”  because  of  the  controversy  surrounding 
many  of  the  proposed  techniques.  Most  of  the  community  does  agree  that  there  is  no 
single  best  robust  regression  estimator,  multiple  outlier  identification  procedure  or 
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variable  selection  procedure.  However,  there  are  widely  varying  opinions  as  to  the 
applicability  of  certain  methods  in  specific  scenarios. 

Hettmansperger  (1998)  states  one  of  the  main  reasons  why  robust  estimation  is 
not  used  more  in  statistics  is  the  “curse  of  abundancy”  for  the  techniques.  His  point  is 
that  not  only  are  there  many  different  estimators  and  algorithms  available,  but  also  that 
each  procedure  has  its  own,  often  large,  set  of  parameter  settings  and  tuning  constants. 
Hettmansperger  also  states  the  lack  of  software  for  robust  procedmes  is  another  reason 
attributing  to  the  scarcity  of  robust  analysis. 

These  two  reasons  present  a  challenging  dichotomy  to  the  regression  user.  On  the 
one  hand,  extra  effort  is  often  required  to  get  the  appropriate  software  to  implement 
existing  robust  procedures.  However,  once  software  capability  is  achieved,  the 
practitioner  is  saturated  with  implementation  options.  Performance  studies  are  needed  in 
finite  samples  to  screen  many  of  the  existing  procedures  and  quantify  where  each  is  best 
suited. 

To  this  end,  a  conqtarison  of  multiple  outlier  detection  procedures  across  a 
comprehensive  set  of  scenarios  is  missing  in  the  literature.  The  ideal  outcome  of  such  a 
study  would  be  that  one  procedure  is  preferred  in  all  scenarios.  If  this  is  not  the  case, 
characteri2ation  of  effective  areas  of  technique  performance  would  be  helpful  guidance. 
The  comparative  evaluation  could  also  suggest  that  some  techniques  could  be  improved 
to  make  them  “robust”  to  more  outlier  scenarios.  A  similar  approach  has  been  taken  by 
Simpson  and  Montgomery  (1998c)  to  propose  a  “robust”  robust  regression  estimator. 
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There  also  is  no  shortage  of  options  for  variable  selection  in  regression.  Some 
performance  studies  exist  and  resampling  methods  are  preferred.  No  work  in  the 
literature  addresses  the  combined  variable  selection  in  the  presence  of  outliers  with 
conpoimd  estimators.  This  is  understandable  because  compound  estimation  is  highly 
conputer  intensive  and  resampling  methods  increase  con^lexity  by  orders  of  magnitude. 

1.3  Research  Objectives 

There  are  three  primary  objectives  that  address  the  research  problem. 

•  Characterize  the  performance  of  the  leading  multiple  outlier  detection  procedures  for 
the  linear  regression  model.  The  goal  is  a  comprehensive  evaluation  of  published 
techniques  that  suggests  where  the  methods  are  successful  and  where  they  fail.  A 
successful  procedure  would  have  a  high  probability  of  identifying  the  outliers,  a  low 
probability  of  classifying  clean  observations  as  oxrtliers  and  be  easily  implemented  in 
analysis  software. 

•  Select  and  improve  upon  the  most  promising  techniques  fi-om  the  conq)arative  study 
of  multiple  outlier  detection  procedures.  This  phase  could  improve  upon  a  direct 
multiple  outlier  detection  algorithm,  a  robust  regression  estimator,  or  a  combination 
both. 

•  Determine  the  appropriateness  and  conputational  fezisibility  of  resampling  methods 
for  variable  selection  in  the  presence  of  outliers.  This  phase  specifically  addresses 
variable  selection  with  compound  robust  regression  estimators  using  the  bootstrap 
and  cross-validation  procedures. 
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1.4  Scope  of  Research 

The  research  focus  is  on  the  finite  sample  size  performance  of  the  published 
techniques  and  also  those  proposed  in  this  research.  The  selection  of  the  techniques  is 
limited  to  those  that  are  promising  and  often  referenced  in  the  literature  and  those  that 
perform  well  in  pilot  studies.  For  this  research,  the  problems  of  outlier  identification, 
robust  estimation  and  variable  selection  are  limited  to  the  linear  regression  model. 
Nonparametric,  Bayesian,  and  nonlinear  regression  and  generalized  linear  models  are  not 
considered;  although,  many  of  the  concepts  explored  easily  extend  to  those  classes  of 
models. 

Monte  Carlo  simulation  is  the  primary  tool  to  accomplish  the  objectives  outlined 
in  Section  1.3.  This  computer-implemented  technique  generates  numerous  data  sets  by 
randomly  varying  specific  values  at  each  iteration.  For  comprehensive  test  and 
evaluation  of  the  methods,  it  is  not  possible  to  cover  many  of  the  infinite  factor  levels  that 
characterize  a  data  set  such  as  the  number  of  observations  (n),  number  of  regressors  {K), 
percentage  of  outliers,  outlier  location,  and  magnitude  of  outliers.  Representative  and 
interesting  levels  of  these  factors  are  selected  from  pilot  studies  and  sequential  analysis. 
Additionally,  each  individual  technique  has  its  own  set  of  specific  parameter  settings. 
Either  the  default  or  most  fevorable  settings  from  pilot  studies  are  used.  In  most  cases, 
the  Monte  Carlo  simulation  experiments  are  set  up  as  feictorial  designs  to  gain  the 
maximum  understanding  from  a  moderate  amount  of  experimentation.  Although  many 
references  suggest  using  thousands  of  Monte  Carlo  simulation  replicates,  the  nature  of 
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this  problem  does  not  lend  itself  to  such  luxury.  In  all  cases,  there  are  enough  replicates 
to  get  a  clear  indication  of  performance. 

1.5  Summary  and  Outline  of  Research 

The  goal  of  this  research  is  to  comprehensively  and  fairly  evaluate  the  leading 
candidate  multiple  outlier  detection  procedures.  These  results  are  then  used  to  introduce 
an  improved  method.  Variable  selection  for  compoimd  estimators  is  then  considered  with 
resampling  methods. 

Chapter  2  reviews  the  relevant  literature  on  what  has  been  published  to  date. 
Chapters  3  through  Chapter  5  are  essentially  stand-alone  documents  that  address  each  of 
the  research  objectives.  Chapter  3  details  the  Monte  Carlo  simulation  performance  study 
of  multiple  outlier  detection  procedures.  Chapter  4  proposes  a  new  con:q>ound  estimator 
using  results  from  extensive  performance  studies  on  measures  of  leverage  and  high- 
breakdown  estimators.  Chapter  5  addresses  the  variable  selection  problem  in  linear  and 
robust  regression  using  resampling  methods.  A  new  variable  selection  criterion  is 
introduced  that  proves  effective  with  both  least  squares  and  compoxmd  estimators. 
Chapter  6  provides  a  summary  of  the  results,  the  contributions,  and  the  recommendations 


for  further  research. 


Chapter  2 

Literature  Review 


2.1  Introduction 

This  chapter  reviews  the  related  literature  for  this  research.  Chapters  3, 4  and  5 
are  written  as  larger  versions  of  what  is  to  be  submitted  for  publication.  As  such,  each 
chapter  contains  its  own  literature  review  restating  many  of  the  results  presented  here. 

The  difference  is  that  more  explanation  of  the  key  concepts  and  algorithms  is  treated  in 
Chapter  2.  This  review  begins  with  some  background  material  on  least  squares 
regression  estimation  and  diagnostics  to  identify  a  single  outlier.  The  discussion  ejq)ands 
to  address  the  multiple  outlier  problem.  A  detailed  presentation  of  direct  procedures  to 
identify  multiple  outliers  is  followed  by  a  thorough  discussion  of  the  indirect 
identification  procedures;  namely  robust  regression  estimators.  The  chapter  concludes 
with  a  discussion  of  variable  selection  methods  in  linear  regression. 

2.2  Ordinary  Least  Squares  Regression 

Regression  analysis  models  the  relationship  between  a  response  variable  and  a  set 
of  predictor  variables.  The  regression  model  for  n  observations  and  k  regressor  variables 
can  be  described  in  terms  of  matrices  as  y  =  Xp  +e  where  y  is  the  «  x  1  vector  of 
observed  response  values  and  X  is  the  observed  nxp  matrix  of  k  regressor  variables 
augmented  with  a  column  of  ones,  p  is  an  unknown  px  \  vector  of  the  regression 
coefficients  and  e  is  the  n  x  1  vector  of  error  terms.  In  practice,  the  regression  model  is 
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y  =  Xp  +  e  where  y  is  the  vector  of  predicted  response  values,  e  is  the  vector  of 
residuals,  and  P  the  estimate  of  regression  coefficient.  OLS  computes  these  parameter 
estimates,  P  ,  by  minimizing  the  sum  of  the  squared  residuals.  Therefore,  the  objective  is 

n 

to  find  those  values  of  p  that  lead  to  the  minimum  value  of  e'e  =  .  Nearly  all 

regression  texts  (e.g.  Montgomery  and  Peck,  1992)  give  the  fundamental  derivation  of 
the  OLS  parameter  estimates  as  follows: 

5(p)  =e'8  =  (y-Xp)'(y-Xp) 

=  y'y-p'X'y-y'Xp+p'X'Xp 
=  yV-2p'X'y+p'X'Xp 

using  differentiation  to  minimize  5(p  )  with  respect  to  p  , 

—  =  -2X'y  +  2X'Xp  =0 

^  p 

rewriting  and  simplifying  gives  the  least  squares  normal  equations 
X'Xp  =X'y 

which  can  be  solved  if  X’X  is  of  full  rank  for  the  familiar  OLS  relationship 
p=  (X'X)-'X'y 

The  vector  of  fitted  values  can  be  eiqiressed  as 


y=  Xp  =  X(XX)-'X'y  =  Hy 
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The  matrix  H  is  known  as  the  hat  or  projection  matrix.  The  diagonal  elements  of  the  hat 
matrix  are  used  in  many  least  squares  diagnostics  because  they  provide  an  indication  of 
remoteness  in  X-space. 

Some  usefiil  properties  of  P  are  that  it  is  an  unbiased  estimator  (E(p  )  =  p  )  and 
the  Gauss-Markov  theorem  guarantees  that  among  all  unbiased  estimators  of  P  ,  the  least 

squares  estimate  has  the  minimum  variance,  Cov(iP)  =  .  A  common  estimate 

for  <T^  is  the  Mean  Square  Error  {MSi)  =  e'  e  l{n-p).  The  least  squares  estimator  of  p  is 
also  the  maximum  likelihood  estimator  under  the  assumption  that  the  error  terms  are 
independent  and  identically  distributed  normal  variates  with  mean  0  and  covariance 
matrix  <t^I  .  The  usual  notation  for  this  assumption  is  e  ~  NID(0,<t2I)  and  if  it  holds, 
then  OLS  is  also  the  uniformly  minimiun  variance  unbiased  estimate  (lIMVUE).  Model 
inferences  such  as  confidence  intervals  and  hypothesis  tests  are  also  very  powerful  if  the 
error  terms  are  NID.  From  a  statistical  point  of  view,  OLS  is  the  optimal  estimator  under 
a  normal  error. 

The  major  disadvantage  of  OLS  is  performance  when  the  error  term  cannot  be 
assumed  to  be  distributed  normal.  OLS  estimates  and  tests  rapidly  lose  power  with 
nonnormal  error  terms.  One  of  the  most  common  violations  of  a  normal  distribution  for 
the  error  terms  is  the  presence  of  one  or  more  outliers  in  the  sample. 

2.3  An  Outlier  in  Least  Squares  Regression 

Barnett  and  Lewis  (1994)  define  an  outlier  as  an  observation  that  appears 
inconsistent  with  the  remainder  of  the  data  set.  Outlier  identification  is  important  in  OLS 
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due  not  only  to  their  impact  on  the  OLS  model,  but  also  to  provide  insight  into  the 
process.  These  outlying  cases  may  arise  from  a  different  distribution  altogether  from  the 
bulk  of  the  data.  The  distribution  of  the  full  dataset  is  contaminated  in  this  instance.  In 
contaminated  datasets,  it  makes  sense  to  see  if  there  may  be  an  alternative  model  form 
g_  lognormal  as  opposed  to  normal  errors)  to  fit  the  true  process.  Alternatively,  the 
distribution  of  the  unusual  observations  may  be  imcontaminated  but  there  may  have  been 
an  external  cause  such  as  a  recording  or  interpretation  error.  These  two  cases  may 
require  a  different  approach  on  how  to  accommodate  the  outlier:  include  it  in  the  model, 
downweight  it  and  include  it  in  the  model,  or  throw  it  away . 

To  classify  types  of  outliers  for  this  research,  consider  the  simple  linear  regression 
model  displayed  in  Figure  2.1.  The  ellipse  defines  the  majority  of  the  data.  Point  A  is  an 
outlier  in  Y-space  because  its  response  value  is  significantly  different  from  the  responses 
contained  in  the  ellipse.  Point  A  is  also  a  residual  or  regression  outlier.  Its  expected 
respxmse,  conditional  on  the  value  ofjc,  differs  significantly  from  the  regression  line  fitted 
to  the  data  in  the  ellipse.  Point  B  is  unusual  in  X-space.  Observations  that  are  remote  in 
X-space  are  high-leverage  points  and  are  also  referred  to  as  exterior  X-space  observations 
in  this  research.  Although  Point  B  is  remote  in  Y-space,  it  is  not  a  residual  outlier 
because  its  response  value  conforms  to  the  regression  line  fit  to  the  observations  in  the 
ellipse.  Point  B  can  be  considered  a  “good  outlier”.  Points  C  and  D  are  high-leverage 
points  and  residual  outliers.  Point  C  is  unusual  in  Y-space;  point  D  is  not. 
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C 


Figure  2.1.  Outlier  configurations:  Points  A,  B,  and  C  are  outlying  in  Y-space  (exterior 
Y-space),  Points  B,  C,  and  D  are  high-leverage  points  (exterior  X-space),  Points  A,  C, 
and  D  are  residual  outliers  because  they  do  not  conform  to  the  regression  line  defined  by 
the  clean  observations  in  the  ellipse. 


2.3.1  Detection  of  an  Ontlier  in  X-space 

The  effect  of  outliers  in  X  or  XY-space  is  to  “pull”  or  exert  more  influence  on  the 
model  parameters  estimating  the  regression  line.  These  observations  are  influential  or 
high-leverage  points.  When  there  are  three  or  fewer  regressor  variables,  candidate 
outlying  observations  in  X-space  can  be  detected  by  a  three  or  two-dimensional 
scatterplot  of  the  regressor  variables.  Computational  measures  are  needed  for  more  than 
three  variables.  The  diagonals,  /j,,.  of  the  n  x  «  hat  matrix,  H ,  provide  a  measure  of 
remoteness  in  X-space.  Because  the  sum  of  the  hat  diagonals  is  p,  the  average  of  all  the 
hat  diagonals  is /?/«.  Hoaglinand  Welsch( 1978)  suggest  observations  with /j*  greater 
than  2  or  'ip/n  should  be  considered  as  potential  outliers. 

A  related  measure  for  multivariate  distance  in  X-space  is  the  Mahalanobis 
distance  and  is  defined  for  the  observation  as: 


Df  =(x,-p)l:-'(x,-p) 
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v^ere  x,  is  the  px\  vector  of  regressor  variables  for  case  /,  pi  is  the  mean  vector  of  X 
and  Z  is  the  pxp  sample  covariance  matrix.  If  the  classical  estimates  of  pi  and  Z  from 
the  full  sample  are  used,  then  it  can  be  shown  that  the  hat  diagonal  is  a  fiinction  of  the 

n2  1  2 

Mahalanobis  distance,  *=-£-+-.  Observations  with  A  greater  than  Z{p.i-a/2)^^ 

"  « - 1  n 

potential  outliers  (Rousseeuw  and  van  Zomeren,  1990). 

2.3.2  Detection  of  a  Residual  Outlier 

An  observation  with  a  relatively  large  residual  value  is  a  candidate  for  an  outlier. 
Scaling  each  residual,  can  often  help  detect  outliers.  Standardized  residuals  correct  for 

e 

the  overall  model  variance  and  are  calculated  for  each  observation  as  . 

Under  the  assumption  of  normally  distributed  error  terms  the  standardized  residuals  can 
be  compared  to  the  percentiles  of  the  standard  normal.  A  possible  problem  with  this 
approach  is  that  the  variance  of  the  residiials  depends  on  their  location  in  X-space, 

Var(f , )  =  o^(  1  -hii).  Behnken  and  Draper  ( 1 972)  suggest  the  constant  variance 

C- 

studentized  residual  defined  as  r,  =  ,  '  . - . .  The  studentized  deleted  residual 

replaces  MSe  in  the  previous  equation  with  the  variance  estimate  obtained  by  removal  of 

the  f  observation  as  sL  =  P)^^e — .  Montgomery  and  Peck  (1992) 

n-p-\ 

state  that  the  studentized  deleted  residual  is  preferred  in  dealing  with  outliers  especially 
since  it  follows  the  /-distribution.  Allen  (1971)  states  that  the  residual  obtained  by  using 


a  model  fitted  with  a  sample  that  omits  the  observation  is  the  prediction  error  sum  of 
squares  residual  (PRESS).  The  PRESS  residual  is  easily  calculated  in  least  squares  as 
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«(0  = 


and  does  not  require  n  separate  OLS  fits. 


2.3.3  Influence  Measures  in  Least  Squares  Regression 

The  hat  diagonal  and  residual  measures  are  useful  diagnostic  measures  to  quantify 
an  observation’s  remoteness  in  X-space  and  the  distance  off  the  regression  surface. 
However,  they  do  not  provide  an  indication  of  how  the  model  parameter  estimates  or 
fitted  values  are  impacted  by  inclusion  of  the  potential  outlier.  Influence  diagnostic 
measures  have  been  developed  to  help  in  making  the  decision  of  what  to  do  with  an 
unusual  observation.  That  is,  an  observation  may  be  a  high-leverage  point  and  residual 
outlier,  yet  inclusion  in  the  analysis  has  little  effect  on  model  parameter  estimates  and 
inferences.  Barrett  and  Gray  (1997)  state  most  influence  diagnostics  can  be  decomposed 
into  a  measure  of  leverage  and  a  measure  of  residual. 

Cook’s  Distance  (Cook,  1979),  A,  incorporates  both  the  remoteness  in  X-space 
and  in  residual. 


pMS^  p 


K 


where  p  is  the  vector  of  parameter  estimates  fi'om  the  OLS  fit  with  observation  i  and 
rf  is  the  squared  studentized  residual.  Cook  recommends  that  distances  greater  than 
Va=.5,p,n-p  s  1 .0  are  considered  influential. 
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Belsley,  Kuh,  and  Welsch’s  (1980)  DFBETAS  statistic  considers  the  impact  of 
leverage  and  residual  on  each  of  the  p  parameter  estimates.  The  statistic  measures  how 
each  parameter  estimate  changes  if  the  observation  is  removed  from  the  data  set. 

Observations  with  DFBETAS  exceeding  2  /  in  magnitude  are  influential. 


DFBETAS.. 

Jr 


wiiere  p  j  is  the  OLS  estimate  of  the  7'*  regression  coefficient  from  a  fit  without  the 

observation  and  Cjj  is  the  diagonal  element  of  (X'X)  ' . 

Belsley,  Kuh  and  Welsch  (1980)  introduced  DFFITSio  measure  the  influence  on 
the  predicted  values  by  omission  of  the  f  observation.  Observations  exceeding  2^p/n 
in  absolute  value  are  considered  influential. 


DFFITS,  = 


A.  ^ 

y^-yu) 


Belsley,  Kuh  and  Welsch  (1980)  define  the  COVRATIO  statistic  to  measure  the 
overall  precision  of  estimation.  This  statistic  is  based  on  the  ratio  of  generalized 
variances  ft)und  from  the  determinant  of  the  covariance  matrix. 

i(x;oX,„)-'s=  I  1 


COVRATIO  =  '  — ('LL  -  Liil 


KX-xi-'MS^ 


MSI 


’«  J 


Belsley,  Kuh,  and  Welsch  suggest  that  observations  varying  by  more  than  3p/n  from 
rmity  may  be  influential. 
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The  leverage,  residual,  and  influence  measures  described  above  can  effectively 
isolate  an  outlier  and  provide  an  indication  as  to  how  much  it  is  affecting  the  model. 

Many  authors  recommend  complementing  these  procedures  with  a  plot  of  predicted 
versus  residual  values,  a  normal  probability  plot  of  residuals,  and  plots  of  each  regressor 
versus  residual  values  for  outlier  detection.  Cook  (1998)  gives  guidance  on  numerous 
other  modem  graphical  procedures  that  can  provide  insight  into  outliers  and  influence  in 
regression.  The  problem  with  these  quantitative  and  graphical  approaches  to  the  outlier 
problem  is  that  they  can  fail  if  there  are  multiple  outliers.  Kempthome  and  Mendel 
(1990)  discuss  the  inadequacies  of  these  single  row  influence  diagnostics  when  applied  to 
multiple  observations. 

2.4  Detection  of  Multiple  Outliers  with  Direct  Methods 

The  reason  many  of  the  techniques  in  Sections  2.3.2  and  2.3.3  fail  with  multiple 
outliers  is  that  they  are  a  function  of  the  covariance  matrix.  If  there  are  too  many 
outliers,  then  the  estimate  of  the  covariance  matrix  is  poor  and  biased  toward  the  outliers. 
The  two  primary  symptoms  from  multiple  outliers  in  regression  are  masking  and 
swamping.  Masking  occurs  when  the  true  outliers  are  not  identified.  This  inflates  the 
estimate  of  error  thereby  affecting  the  power  of  test  statistics.  Swamping  occurs  when 
inliers  are  identified  as  outliers.  One  possible  solution  to  the  problem  is  to  analyze 
subsets  of  the  observations  thought  to  be  outliers. 

Belsley,  Kuh  and  Welsch  (1980)  and  Cook  and  Weisberg  (1982)  extend  their 
influence  diagnostic  measures  to  accommodate  subsets  of  observations  rather  than  just 


21 


one.  These  authors  and  Sebert  (1995)  demonstrate  that  the  multiple  row  diagnostics 
effectively  assess  the  joint  influences  exerted  by  several  outliers.  Barrett  and  Ling  (1992) 
and  Barrett  and  Gray  (1997)  propose  inq)roved  multiple  row  diagnostics  that  are  based  on 
measures  of  leverage,  residual  and  the  interaction  between  the  two.  The  problem  with  all 
multiple  row  influence  diagnostics  is  that  the  correct  subset  must  be  tested.  This  presents 
a  significant  combinatorial  problem  with  increasing  sample  size.  Several  procedures  to 
identify  this  outlying  set  have  been  published  in  the  last  25  years.  Hadi  and  Simonoff 
(1993)  classify  the  procedures  as  direct  or  indirect.  Direct  procedures  use  a  specifically 
designed  algorithm  to  detect  multiple  outliers.  The  indirect  methods  use  either  the 
weights  assigned  to  each  observation  or  the  residuals  from  a  fit  with  a  robust  regression 
estimator. 

This  section  chronologically  describes  the  direct  methods  to  detect  multiple 
outliers  in  linear  regression.  For  the  procedures  that  are  used  in  the  performance  studies, 
there  is  a  detailed  outline  of  the  algorithm  that  is  significantly  expanded  from  the  short 
summary  provided  in  chapter  3.  There  is  a  brief  description  of  several  other  procedures 
for  historical  purposes  and  reference. 

2.4.1  Gentleman  and  Wilk  Subsets  Algorithm 

Gentleman  and  Wilk  (1975)  are  generally  credited  with  first  addressing  methods 
to  detect  the  multiple  outliers  in  the  least  squares  regression  model.  Their  Q  statistic  is 
based  on  the  reduction  in  error  sum  of  squares  from  the  model  including  all  observations 
to  that  of  a  model  of  size  (n  -j)  where  j  is  the  pre-specified  maximum  number  of  potential 
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outlying  cases.  This  statistic  is  computed  for  all  subsets  of  sizey  and  those  subsets  with 
large  Q  values  are  considered  potential  outlier  sets.  The  method’s  limitations  are  the 
computational  complexities  for  large  n  and  the  requirement  to  specify  the  ejqjected 
number  of  outHers.  Gentleman  (1980)  addressed  the  computational  complexity  issue  by 
sequentially  selecting  the  set  of  outliers  based  on  OLS  studentized  residuals  from  the  foil 
sample.  This  procedure  still  suffers  from  masking  and  swamping  because  studentized 
residuals  may  not  appear  unusual  if  there  are  multiple  outliers. 

2.4.2  Hawkins,  Bradu,  and  Kass  Elemental  Sets  Algorithm 

Hawkins,  Bradu  and  Kass  (1984)  identify  multiple  outliers  in  regression  models 
with  elemental  sets.  Numerous  random  samples  of  size  p  are  formed  from  the  original 
data  set  and  fit  with  an  OLS  regression  model.  Outliers  are  the  observations  with  large 
values  for  the  summary  statistics  (e.g.  median)  on  the  set  of  residuals  from  all  of  these 
regressions.  This  procedure  is  similar  to  bootstrapping  and  suffers  from  conputational 
complexity.  Also,  the  OLS  residuals  may  not  be  imusual  for  the  high-leverage  regression 
outliers. 

2.4.3  Marasinghe  Backward  Selection  Algorithm 

Marasinghe  (1985)  proposes  a  multi-stage  procedure  that  also  requires 
specification  of  the  expected  maximum  number  of  outliers.  The  outliers  are  sequentially 
removed  based  on  the  largest  absolute  value  of  the  studentized  residual.  The  test  statistic 
Fy  is  the  ratio  of  error  sum  of  squares  from  the  reduced  model  with  (n  -j)  observations  to 
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the  error  sum  of  squares  for  the  full  sample  model.  If  Fy  exceeds  a  critical  value  from  the 
Bonferroni  inequalities,  then  the  current  set  is  the  outlier  set;  otherwise,  the  procedure  is 
repeated  using  (j  - 1)  candidate  outliers.  This  methodology  again  relies  on  the 
studentized  residual  which  suffers  from  masking  and  swamping;  particularly  ify  is 
specified  too  large  (Fung,  1988). 

A  proposed  methodology  by  Kianifard  and  Swallow  (1989,  1990  and  described 
for  their  1996  update  in  Section  2.4.9)  was  compared  to  Marasinghe  (1985)  and 
modifications  to  Gentleman  and  Wilk  (1975)  in  several  outlying  scenarios.  The  results 
were  scenario  dependent;  however,  for  multiple  outliers,  Marasinghe’ s  multi-stage 
procedure  performed  the  best.  Kianifard  and  Swallow  also  note  the  poor  performance 
from  misspecification  of  the  number  of  outliers  in  Marasinghe’ s  procedure. 

2.4.4  Rousseeuw  and  van  2fomeren  MVE/LMS  Plot 

The  highly-referenced  Rousseeuw  and  van  Zomeren  (1990)  methodology  is  based 
on  robust  regression  estimators.  They  suggest  using  the  minimum  volume  ellipsoid 
(MVE)  described  in  Section  2.5.4  as  a  robust  estimate  of  both  the  mean  and  covariance 
matrix  to  detect  outliers  in  X-space.  The  robust  distance  from  the  Mahalanobis  Distance 
using  the  MVE  estimates  of  the  mean  and  covariance  matrix  can  be  compared  to  the 
2rp-l,0.975  distribution  to  conclude  whether  the  point  is  influential  in  X-space  only.  The 

standardized  residuals  from  a  least  median  of  squares  fit  (see  Section  2.5.2)  are  used  to 
identify  residual  outliers.  This  procedure  then  classifies  an  observation  into  one  of  four 
categories:  1)  not  an  outlier,  2)  a  residual  outlier  only,  3)  a  leverage  outlier,  or  4)  an 
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outUer  in  both  residual  and  X-space.  The  methodology  performs  well  on  several 
challenging  multiple  outlier  data  sets  where  the  classical  distance  measures  fail.  This 
method  is  available  in  the  S-Plus  software. 

This  procedure,  as  Cook  and  Hawkins  (1990)  noted  in  their  discussion  of  the 
paper,  suffers  firom  identifying  too  many  outUers.  Other  authors  have  similar  reservations 
about  using  the  MVE  (Simpson,  1995,  and  Woodruff  and  Rocke,  1994).  However,  there 
have  been  improvements  to  the  MVE  and  LMS  algorithms  recently  that  increase  the 
statistical  and  computational  efficiencies  of  the  procedures  (Bums,  1992). 

2.4.5  Paul  and  Fung  Backward  Selection  Algorithm 

The  Paul  and  Fung  (1991)  two-phase  procedure  using  generalized  extreme 
studentized  residuals  (GESR)  tries  to  minimize  the  effect  of  overspecifyingy,  the 
maximum  number  of  outliers,  in  the  Marasinghe  (1985)  method.  The  algorithm  forms 
the  set  of  up  to  j  residual  oxrtliers  by  sequential  deletion  of  the  observation  with  the 
highest  absolute  value  of  the  studentized  residual.  This  residual  value  must  exceed  a 
Bonferroni  critical  value.  A  model  is  refit  without  the  potential  outliers  and  the  largest 
studentized  residuals  are  tested  again.  A  similar  procedure  used  in  phase  two  is  to  search 
for  outlying  values  in  X-space  using  Cook’s  D.  The  imion  of  these  two  sets  would  be 
declared  the  outliers.  Hadi  and  Simonoff  (1993)  show  through  benchmark  examples  and 
simulation  that  this  method  suffers  fi'om  both  masking  and  swamping. 
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2.4.6  Hadi  and  Simonoff  Forward  Selection  Algorithm 

Hadi  and  Simonoff  (1993)  consider  two  related  procedures  for  the  identification 
of  multiple  outliers  in  regression  models.  Their  procedure  is  based  on  finding  a  “clean” 
subset  of «  -  A:  +  1  observations  that  has  the  minimum  residual  sum  of  squared  errors. 

The  algorithm  proceeds  as: 

1 .  Determine  the  initial  clean  subset  M  of  size  h  =  {n  +  k- 1)/2. 

a.  Version  1  for  determining  M  is  an  adaptation  of  Hadi  (1992, 1994)  and 
related  to  the  elemental  sets  of  Hawkins,  Bradu,  and  Kass  (1984). 
a.  1  Order  the  n  observations  by  the  magnitude  of  the  OLS 

adjusted  residuals,  a,  =  e,  /  . 

a.2  Form  the  basic  subset  B  by  selecting  the  +  1  lowest  values  of 
the  \a,\. 

a.3  Fit  an  OLS  model  to  set  B 

Order  the  scaled  residuals  defined  by 


sr,  = 


Vd-x'  (X',X^)-'x, 


,ieB 


sr,  = 


I 


,/ B. 


a.4  If  s,  the  size  of  the  basic  subset,  is  equal  to  h  then  go  to  step  2, 
else  use  the  first  s  +  1  observations  ordered  by  the  scaled  residual  as  the  new  basic 


subset,  B.  Go  to  step  a.3. 
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b.  Version  2  for  determining  the  initial  clean  subset  M  is  based  on 
Simonoff  (1991)  using  a  single  linkage  clustering  algorithm  to  detect  multivariate 


outliers. 

b.l  Standardize  the  data  by  dividing  Z  =  (X  :  Y)  by  2*^ .  Note 
that  the  authors  found  the  classical  estunate  of  2  superior  to  the  MVE. 

b.2  Construct  the  single  linkage  clustering  tree  for  all  n 

observations. 

b.3  Order  clusters  from  most  to  least  extreme  (the  more  extreme, 
the  later  the  cluster  joins)  and  consider  cases  in  smaller  clusters  as  potential 
outliers. 

b.4  Cluster  until  there  arcn-h  “extreme”  clusters;  the  remaining  h 
cases  are  the  clean  data  for  the  initial  subset  M. 

2.  Compute  the  internally  studentized  residual  or  scaled  prediction  error,  di. 

A 

d  = -  e  M  (studentized  residual) 

d  = -  yj  M  j  ^  (scaled  prediction  error) 

3.  Define  j  as  the  size  of  the  current  subset.  then  all 

observations  with  \d,\  exceeding  this  critical  value  of  the  t  distribution  are  outliers. 
Otherwise,  find  a  new  subset  M  by  using  the  first  5+1  ordered  observations.  If  s 
+1  =  w,  then  there  are  no  outliers  to  consider. 
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Hadi  and  Simonoff  show  that  their  methodology  is  successful  in  only  two  of  the 
four  “benchmark”  example  sets  from  the  Rousseeuw  and  Leroy  (1987)  robust  regression 
text.  Kianifard  and  Swallow’s  (1989)  procedure  and  Marasinghe’s  (1985)  multi-staged 
approach  fail  on  all  four.  The  best  procedure  was  the  multi-staged  robust  regression  MM 
estimator  (see  Section  2.5.3).  Hadi  and  Smionoff  also  conduct  a  limited  (n— 25,  p~2  or  3) 
Monte  Carlo  simulation  using  similar  outlying  scenarios  to  Kianifard  and  Swallow.  The 
results  suggest  Least  Median  Squares,  Reweighted  Least  Squares  and  Least  Trimmed 
Sum  of  Squares  perform  poorly  for  outliers  at  low-leverage  due  to  their  low  efficiency. 

A  high-efficiency  MM  estimator  is  sensitive  to  high-leverage  outliers  and  breaks  down 
after  3  observations  while  a  lower  efficiency  A^Mestunator  (70%)  is  better  for  the  high- 
leverage  outliers  at  the  ejqpense  of  significant  swamping.  These  results  agree  with 
Simpson’s  (1995)  simulations  where  the  only  weakness  for  MM  estimators  is  the 
combination  of  high-leverage  and  low-dunension.  Hadi  and  Sunonoff  conclude  that  their 
procedure  (Version  1)  is  preferable  to  all  others  based  on  computational  ease,  known 
cutoff  values,  and  overall  performance.  Their  estimators  did  not  breakdown  nor 
excessively  swamp  in  the  presence  of  multiple  high-leverage  points. 

2.4.7  Atkinson  Stalactite  Plot 

Atkinson  (1994)  uses  a  computationally  attractive  alternative  to  Rousseuuw  and 
van  Zomeren  (1990)  based  on  the  LMS  residuals  and  MVE.  Forward  selection 
minimizes  the  probability  of  including  an  outUer  in  the  observations  in  the  MVE.  The 
search  is  conducted  several  times  at  random  starting  points  to  find  the  “global”  MVE 
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subject  to  no  outliers.  Simulation  is  required  to  scale  the  LMS  residuals  to  determine 
“outlyingness”.  Stalactite  plots  conveniently  show  which  observations  exceed  a  critical 
cutoff  value  from  the  scaled  LMS  residuals  as  a  function  of  subsample  size.  When  the 
size  of  the  subsample  is  equal  to  the  number  of  observations,  the  stalactite  plot  displays 
the  effect  of  masking  and  offers  guidance  in  selecting  appropriate  subsample  sizes  for 
protection  against  masking.  The  procedure  performs  well  in  several  of  the  benchmark 
examples,  but  the  algorithm  is  outdated  for  LMS  and  MVE. 

2.4.8  Pena  and  Yohai  Eigenanalysis 

Pena  and  Yohai  (1995)  describe  a  procedure  to  detect  influential  subsets  in 
regression  using  eigenanalysis  on  the  influence  matrix.  The  nxn  mfluence  matrix  is 
defined  as  the  uncentered  covariance  of  a  set  of  vectors  which  represent  the  effect  on  the 
fit  of  the  deletion  of  each  data  point.  Define  t,  =  y  -  y(j)  =  {e,  /(I  -  h„)}h^ ,  where  h .  is 

the  f"  column  of  the  hat  matrix.  If  T  =  (t, . . .  t„),  then  the  influence  matrix  M  = 

T'  T//?5^.  The  univariate  Cook’s  Distance  for  each  observation  is  on  the  diagonal  of  M. 
The  algorithm  is  as  follows: 

1.  Form  the  influence  matrix  as  M  =  EDHDE(p5^  where  E  is  the  diagonal  matrix 
of  residuals,  D  is  the  diagonal  matrix  with  elements  (1  - 

H  =  X(X'X)“‘  X' ,  and  s^  is  the  usual  mean  square  error  estimate  of  the 
variance. 

2.  Computationally,  it  is  better  to  consider  a  deconqwsition  of  the  influence 
matrix  because  only  a  subset  of  the  eigenvectors  is  of  interest.  Define  A  = 
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BA'^  where  the  colunms  of  B  are  the  eigenvectors  of  (X'  X)  and  A  is  the 
diagonal  matrix  of  the  associated  eigenvalues.  If  P  =  EDXA/(p  s),  the 
eigenvectors  from  the  non-null  eigenvalues  of  the  influence  matrix  M  are  Pv, 
where  v,  are  the  eigenvectors  fromPT . 

3.  Find  the  eigenvectors  of  the  p  non-null  eigenvalues  of  the  influence  matrix. 

4.  Order  the  components  within  each  eigenvector,  v„  in  ascending  order  to 

obtain  the  order  statistics  v,(i)<  v/(2)< . .  .v,(„) . 

5.  Search  the  eigenvectors  for  observations  with  large  positive  or  large  negative 

components.  These  sets  will  be  considered  candidates  for  outliers.  The  ratio 
aj  =  for 7  =  n  -  Cl  searches  for  a  breakpoint  for  the  positive 

components.  Similarly,  bj  =  ViQ)/Vi(i+i)  fory  =  1, . . .,  ca  finds  the  breakpoint  for 
the  negative  components.  The  constants  Ci  and  Ca  define  what  percentage  of 
the  total  observations  should  be  considered  as  potential  outliers.  In  practice, 
the  authors  recommend  w/4  for  both  values  to  detect  up  to  50%  outlying 
observations.  It  also  makes  sense  that  both  of  these  constants  should  be  equal 
since  we  do  not  know  a  priori  whether  the  outliers  will  load  positively  or 
negatively.  In  feet,  eiqierimentation  in  identical  scenarios  shows  the  outliers 
to  load  inconsistently  between  replicates. 

6.  Search  for  a  possible  negative  and  positive  breakpoint  in  the  components  in 
each  eigenvector.  For  a  particular  eigenvector,  select  the  first y'o  such  that  |ajl 
>  it  then  consider  the  observations  corresponding  to  they**  ordered  component 
up  to  «  as  candidate  outliers.  For  the  negative  values  on  the  components. 
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select  the  first  such  that  |6j|  >  /:  and  consider  the  observations  corresponding 
to  the  largest  negative  ordered  component  up  to  the^/o^*  as  candidate  outliers. 
The  key  to  this  step  is  selecting  k,  the  minimum  ratio  required  to  declare 
outliers.  This  parameter  is  highly  significant  in  determining  the  tradeoff 
between  high  power  in  detecting  the  outliers  and  high  false  alarm  rate.  The 
authors  recommend  2.5;  however,  this  may  lead  to  too  many  false  alarms  in 
small  samples. 

7.  The  last  step  evaluates  the  candidate  outlier  sets  identified  from  step  6  by 
eliminating  the  observations  from  an  OLS  fit  and  evaluating  the  t  tests 
(Bonferroni)  for  each  candidate  outlier  and  outlier  sets. 

The  authors’  limited  testing  of  the  procedure  shows  it  to  perform  well  in  high- 
leverage  cases,  especially  with  a  low  amoimt  of  contamination.  The  method  also 
correctly  identifies  outliers  from  the  challenging  Hawkins,  Bradu,  and  Kass  dataset. 

2.4.9  Swallow  and  Kianifard  Recursive  Residual  Algorithm 

Swallow  and  Kianifard  (1996)  address  the  deficiencies  from  their  1990  recursive 
residual  methodology  for  multiple  outliers.  The  improved  procedure  replaces  classical 
estimates  of  variance  with  robust  measures;  the  easily  computed  interquartile  range  (IR) 
and  median  absolute  deviation  from  the  median  (MAD).  The  IR  estimate  of  the  standard 
deviation  is  the  75'*  percentile  -  25'*  percentile  of  the  OLS  or  recursive  residuals.  The 
MAD  estimate  of  standard  deviation  is  median  {|e,  -  median  {e,}|}  using  OLS  or 


recursive  residuals. 
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The  procedure  works  for  both  OLS  and  recursive  residuals.  In  practice,  the 
authors  claim  using  recursive  residuals  almost  always  leads  to  greater  detection  power. 
The  outlier  detection  algorithm  using  recursive  residuals  is  as  follows: 

1 .  Order  the  studentized  residuals  from  an  OLS  fit  fi)r  all  n  observations. 

2.  Use  the  first  p  ordered  observations  for  the  basis  for  computing  the  recursive 
residual, 


d+x'  (x;.,x,.,r*x,) 


vI/2 


y  j  P  ^  1,...,W  . 


3.  Compute  the  robust  estimate  of  scale  o .  This  is  the  MAD  or  IR  using  the 
OLS  residuals  divided  by  a  correction  fiictor.  The  correction  factor  is 
determined  by  finding  the  mean  IR  and  MAD  estimates  from  a  simulation 
under  the  null  hypothesis  of  no  outliers  with  the  same  number  of  parameters 
and  observations  as  the  data  being  analyzed. 

4.  Conqjute  the  test  statistics  \wj  /  a|  for  each  observation. 

5.  Compare  the  test  statistics  to  a  critical  value.  Again,  the  critical  value  must 
come  from  simulating  the  given  scenario  under  the  hypothesis  of  no  outliers. 
The  critical  values  are  found  as  the  quantiles  of  the  distribution  of  test 
statistics.  It  turns  out  that  the  critical  values  are  virtually  identical  whether 
using  a  MAD  or  IR  estimate  of  scale.  These  values  are  also  very  similar  to 
the  percentiles  of  the  standard  normal  density;  particularly  as  n  gets  large. 

6.  If  the  test  statistic  exceeds  the  critical  value,  then  classify  the  respective 
observation  as  an  outlier.  This  is  the  “recursive  method”.  The  authors  also 
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provide  a  modification  that  can  be  described  as  an  inward  procedure  because 
it  looks  at  the  maximum  1^^  /  a|  to  see  if  the  no  outliers  hypothesis  can  be 

rejected.  If  it  is  rejected,  then  the  observation  with  the  largest  test  statistic  is 
declared  an  outlier,  it  is  deleted  and  the  procedure  repeats  itself  by  calculating 
the  new  recursive  residuals.  This  iterates  imtil  there  are  no  outliers  detected. 
This  version  is  extremely  computer  intensive  and  does  not  offer  a  significant 
advantage  over  the  basic  recursive  method. 

Swallow  and  Kianifard  run  simulations  with  the  same  seven  outlying  scenarios  as 
their  previous  work  (1990)  and  those  of  Hadi  and  Siminofif  (1993)  which  are  limited 
because  only  a  single  regressor  with  «  =  25  is  used.  The  results  show  insensitivity  to  the 
IR  or  MAD  estimate  of  ct,  no  significant  swanq)ing  in  any  scenario  by  any  method, 
moderately  higher  power  in  detecting  outliers  fi'om  recursive  residuals  over  OLS 
studentized  residuals,  and  the  usefulness  of  robust  estimator  over  OLS  in  masking 
scenarios.  No  method  performs  well  until  the  outlying  distance  is  at  least  4g.  The 
authors  claim  of  simplicity  as  the  primary  advantage  to  their  methodology  is  questionable 
based  on  the  number  of  simulations  required  for  distribution  properties. 

2.4.10  Sebert,  Montgomeiy,  and  Rollier  Clustering  Algorithm 

Sebert  et  al.  (1998)  suggest  an  approach  for  identifying  a  reasonable  candidate 
subset  of  multiple  outliers  that  avoids  the  conq)lexities  associated  with  most  competing 
procedures.  The  methodology  clusters  observations  from  an  easily  formed  projection  of 


the  data  into  two  independent  dimensions.  Specifically,  Sebert  et  al.  suggest  the 
following  steps: 
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1 .  Standardize  the  predicted  and  residual  values  from  an  OLS  fit. 

2.  Cluster  these  observations  using  Euclidean  distance  and  a  single  linkage 
clustering  algorithm. 

3.  Form  clusters  based  on  tree  height  {ch,  a  measure  of  closeness)  using 
Mojena’s  stopping  rule  (ch  =  h+  7. 25^^,  where  h  is  the  average  height  of  the 
tree  and  s*  is  the  sample  standard  deviation  of  heights).  Note  that  tree  height 
is  a  measure  of  cluster  separation. 

4.  The  single  largest  cluster  is  the  clean  data  while  the  remaining  subsets  are  all 
candidates  for  outliers. 

5.  Assess  the  influence  of  the  candidate  observations  using  multiple  row 
diagnostics. 

Simulated  regression  data  sets  demonstrate  the  success  of  the  methodology.  The 
procedure  is  generally  very  powerful  at  detecting  outliers  and  performs  well  on  the 
classic  challenging  data  sets.  Other  significant  simulation  results  include: 

•  The  correct  observations  are  identified  as  outliers  increasingly  better  as 

outlying  distance,  number  of  observations,  number  of  regressors,  and  percentage 

of  outliers  increase.  The  last  two  are  counter  to  what  most  published  results 


show. 


•  Performance  is  worst  when  there  is  one  outlying  group  along  the 
regression  line  and  another  group  at  the  same  location  in  X  space,  but  with 
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significantly  larger  residual  values. 

•  The  number  of  clean  observations  classified  as  outliers  (false  alarms) 
decreases  as  the  number  of  regressors  increases.  This  felse  alarm  rate  increases  as 
percentage  of  outliers  and  number  of  observations  increase. 

•  The  null  case  is  a  limitation  to  the  methodology.  When  there  are  no 
outliers,  then  approximately  20%  of  the  observations  are  identified  as  candidates 
for  outliers. 

2.4.11  Lee  and  Fung  Forward  Selection  A^orithm 

Lee  and  Fung  (1997)  propose  a  stepwise  algorithm  to  detect  multiple  outliers  in 
generalized  linear  models  (GLIMs)  and  nonlinear  regression  based  on  a  high  breakdown 
robust  estimator.  They  determine  the  clean  data  set  from  the  studentized  residual  (GLIM 
raw  residual  over  standard  error)  from  a  robust  fit  and  sequentially  add  some  of  the  initial 
outliers  back  to  the  clean  set  since  too  many  outliers  are  identified.  Outliers  are  added 
back  by  determining  the  upper  5%  bound  on  the  studentized  residuals  via  Monte  Carlo 
simulation.  This  procedure  iterates  until  no  observations  exceed  the  5%  upper  bound. 
There  were  no  problems  encountered  in  the  selected  examples,  but  further  simulation  is 
required  to  accurately  assess  finite  sample  performance. 
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2.4.12  Luceno  Reweighted  Least  Deviances  Algorithm 

Luceno  (1998)  discusses  using  the  weights  from  a  reweighted  least  squares 
procedure  to  detect  muhiple  outliers  in  the  GLIM.  The  mean  of  the  deviances  (sum  of 
squared  deviance  residuals)  is  replaced  by  a  weighted  mean  of  deviances.  The  weights 
are  calculated  with  a  Huber  or  redescending  function.  The  parameter  estimates  come 

n 

from  minimization  of  the  quantity  n''  ^  w,  D,  (/^;^;y) .  A  is  the  squared  deviance 

/=! 

residual  for  the  /'*  observation,  //  is  the  mean  (Xp  in  normal  theory),  ^  is  the  nuisance 
parameter  (a  in  normal  theory  models),  and  w,  is  the  weight  from  the  influence  function. 
If  weights  from  Huber’s  function  are  used,  then  w,  =  1.5/|  |  if\Dj^^\>  1.5  otherwise 

Wi  =  1.0.  The  procedure  avoids  estimating  cr  (or  the  appropriate  nuisance  parameter)  by 
assuming  detection  of  outliers  is  insensitive  to  <t  within  a  certain  range.  Outliers  are 
considered  observations  with  imusually  low  values  for  the  weights.  Luceno  suggests 
direct  minimization  of  the  objective  fimction  is  computationally  reasonable  (when 
compared  to  LTS  or  LMS)  and  should  be  done  on  random  subsets  to  avoid  local 
minimums. 

The  procedure  successfully  detects  outliers  in  several  examples  from  McCullagh 
and  Nelder  (1989)  and  also  identifies  the  outliers  in  the  stackloss  data  set.  The  method 
appears  to  be  effective  at  detecting  leverage  outliers.  Performance  apart  from  4  examples 
is  not  reported. 
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2.5  Robust  Regression 

Either  the  residual  or  the  observation’s  final  weight  from  a  robust  estimator  can 
be  used  to  identify  multiple  outliers  in  regression.  Robust  regression  acconunodates 
outliers  by  judiciously  downweighting  them  through  the  selection  of  model  and  input 
parameters.  We  also  consider  robust  regression  estimators  beyond  the  purpose  of  outlier 
identification  in  this  research.  The  literature  on  robust  regression  is  vast  and  what 
follows  is  only  a  portion  that  is  most  directly  applicable  to  this  research. 

2.5.1  Properties  of  Robust  Regression  Estimators 

The  three  most  important  properties  for  robust  regression  estimators  are 
breakdown,  efficiency,  and  bounded-influence.  The  concept  of  breakdown  is  the  primary 
motivation  for  using  robust  regression  over  OLS.  The  breakdown  point  is  defined  as  the 
smallest  fraction  of  anomalous  data  that  can  render  the  estimator  useless.  As  displayed  in 
Figure  1.1,  a  single  outlying  point  can  significantly  change  the  OLS  estimates  of  ^  ;  the 
breakdown  point  is  Mn,  or  0%  because  n  can  be  made  arbitrarily  large.  Robust  estimators 
can  have  breakdown  points  as  high  as  50%. 

Another  desirable  property  for  robust  estimators  is  efficiency.  The  efficiency  is 
defined  as  the  performance  of  the  robust  estimator  relative  to  OLS  under  the  assumption 
of  no  outliers;  e  is  NID  (0,  a^l ).  Recall  that  the  OLS  estimate  will  be  the  minimum 
variance  estimate  among  all  unbiased  estimators.  Typically,  efficiency  is  expressed  as 
the  ratio  of  mean  square  errors. 
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The  third  desirable  property  is  bounded-influence  in  X-space.  This  is  the 
estimator’s  resistance  to  being  “pulled”  toward  the  extreme  observations  in  X-space. 

Least  squares  is  not  bounded-influence  and  the  more  remote  observations  exert  greater 
influence  on  the  parameter  estimates. 

2.5.2  High-Breakdown  Point  Estimators 

High-breakdown  point  (HBP)  regression  estimators  have  been  developed  to 

provide  reliable  estimates  in  the  presence  of  a  large  percentage  of  outlying  observations. 

These  estimators  can  achieve  up  to  a  50%  breakdown  point  and  are  also  know  as  resistant 

estimators.  They  are  useful  for  outlier  detection  and  initial  estimators,  but  their  low 

efficiency  and  imbounded  influence  deter  from  their  use  as  stand-alone  estimators. 

Least  Median  of  Squares  (IMS)  Estimators.  Rousseeuw  (1984)  introduced  the 

high-breakdown  (as  much  as  50%)  LMS  estimators.  LMS  is  obtained  by  minimizing  the 

/»'*  ordered  squared  residual  where  h  is  defined  as  the  integer  portions  of  nil  +  (p+l)/2. 

The  objective  function  can  be  expressed  as  min  median(ef).  LMS  fits  just  over  half  the 

P 

data  and  minimizes  the  residual  for  a  single  observation.  The  original  proposal  solved 
the  objective  function  with  random  resampling;  however,  improved  algorithms  now  exist 
(Bums,  1992  and  Atkinson,  1994).  The  primary  unattractive  characteristic  of  LMS  is  an 
asymptotic  efficiency  of  0%.  Although  useful  if  severe  contamination  is  suspected  or 
when  used  in  conjunction  with  other  techniques,  LMS  has  grown  out  of  fevor.  Ryan 
(1997)  argues  against  LMS  based  on  an  unstable  algorithm  (computationally  intense  and 
possibly  different  solutions  using  the  same  data)  and  small  changes  in  the  data  result  in 
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large  changes  in  parameter  estimates.  He  concludes  LMS  should  not  be  used  as  a  stand¬ 
alone  estimator,  an  initial  estimator  or  an  outlier  detection  estimator. 


Least  Trimmed  Sum  of  Squares  (LTS)  Estimators.  Rousseeuw  (1984, 1 985) 
proposed  the  LTS  high  breakdown  estimator  as  an  efficient  alternative  to  LMS.  The  LTS 
estimator  is  formed  by  minimizing  the  ^  out  of «  ordered  squared  residuals  from  smallest 
to  largest.  Rousseeuw  and  Leroy  (1987)  recommend  h  =  w(l-a)  +  1  where  a  is  the 

trimmed  percentage.  The  objective  fimction  is  min'V(e,^);.„  and  it  is  solved  with  either 

^  i=i 

random  resampling  (Rousseeuw  and  Leroy,  1987),  a  genetic  algorithm  (Bums,  1992)  or 
forward  search  (Woodruff  and  Rocke,  1994).  This  estimator  is  attractive  because  a  can 
be  selected  to  prevent  some  of  the  poor  results  other  50%  breakdown  estimators  show. 
LTS  can  be  fairly  efficient  if  the  number  of  trimmed  observations  is  close  to  the  number 
of  outliers  because  OLS  is  used  to  estimate  parameters  from  the  remaining  h 
observations.  The  LTS  estimator  can  become  conputationally  intense  as  the  number  of 
observations  increase. 


S-Estimators.  Rousseeuw  and  Yohai  (1984)  develop  a  high  breakdown  estimator 
(as  much  as  50%)  that  minimizes  the  dispersion  of  the  residuals.  The  objective  function 
is  min  s{e^  (P  (p  ))  where  e,  (p  )  is  the  f  residual  for  candidate  p  .  This  objective 


n 

function  is  given  by  the  solution  to  («  -p)~ 


=  K  where  is  a  constant 
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E^[p\  with  O defined  as  the  standard  normal.  Rousseeuw  and  Yohai  (1984)  suggest  a 

X*  . 

redescending  influence  fiinction  as  p{x)  =  —  —  if  |x|  <  c  otherwise 


p{x)  =  — .  The  parameter  c  is  the  tuning  constant.  Tradeoffs  in  breakdown  and 
6 

efficiency  are  possible  based  on  choices  fi)r  the  tuning  constant  c  and  K.  The  usual 


choice  is  c  =  1.548  and  AT  =  0.1995  for  50%  breakdown  and  about  28%  asymptotic 


efficiency  (Rousseeuw  and  Leroy,  1987). 

The  final  scale  estimate,  s,  is  the  standard  deviation  of  the  residuals  fi'om  the  fit 
that  minimized  the  dispersion  of  the  residuals.  The  scale  estimate  is  an  implicitly  derived 
M-estimate  of  scale.  Ruppert  (1992)  suggests  an  improved  resampling  algorithm  and 
concludes  that  5-estimators  perform  marginally  better  than  LMS  and  LTS. 


2.5.3  M-Estimators  and  Multi-Stage  Procedures 

M-Estimators.  M-estimators  are  maximum  likelihood  robust  estimators  proposed 
by  Hampel  (1973)  that  are  nearly  as  efficient  as  OLS.  Rather  than  minimize  the  sum  of 
squared  errors  as  the  objective,  the  M-estimate  minimizes  a  fimction  />of  the  errors.  The 


n 

A/-estimate  objective  function  is  min  ^  p 


where  s  is  an 


estimate  of  scale  often  formed  fi'om  a  linear  combination  of  the  residuals.  The  system  of 
normal  equations  to  solve  this  minimization  problem  is  found  by  taking  partial 
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n 

derivatives  with  respect  to  P  and  setting  them  equal  to  0,  yielding  ^  y/ 

/=! 


=  0 


where  y/  is  the  derivative  of  p . 

The  choice  of  the  -function  is  based  on  the  preference  of  how  much  weight  to 
assign  outliers  (see  e.g.  Montgomery  and  Peck,  1992).  A  monotone  ^-function  does  not 
weight  large  outliers  as  much  as  leas>t  squares  (e.g.  a  1  Oo  outlier  would  receive  the  same 
weight  as  a  3c  outlier).  A  redescending  y/  -function  increases  the  weight  assigned  to  an 
outlier  vmtil  a  specified  distance  (e.g.  3c)  and  then  decreases  the  weight  to  0  as  the 
outlying  distance  gets  larger. 

Newton-Raphson  and  Iteratively  Reweighted  Least  Squares  (IRLS)  are  the  two 
methods  to  solve  the  M  estimates  nonlinear  normal  equations.  IRLS  is  the  most  widely 
used  in  practice  and  the  only  one  considered  for  this  research.  IRLS  ejq>resses  the  normal 

equations  as  XWXp  =  XAVy  where  W  is  an  n  x  «  diagonal  matrix  of  weights 

^  _  ¥\yi  jjjg  initial  vector  of  parameter  estimates,  p  o  >  typically 

obtained  from  OLS  or  a  high-breakdown  point  estimator.  IRLS  updates  these  parameter 
estimates  with  p  j  =  (X'WX)”’X'Wy .  The  procedure  continues  imtil  some  convergence 
criterion  is  satisfied.  The  estimate  of  scale  may  be  updated  after  the  initial  estimate. 


Generalized  M-Estimators.  The  Generalized  M-estimators  (GM),  proposed  by 
Mallows  (1975)  and  improved  by  Krasker  and  Welsch  (1982),  were  developed  to 
overcome  the  limitations  of  M-estimators  for  high-leverage  observations.  The  GM- 
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estimator  bounds  the  influence  in  X-space  by  weighting  the  M-estimate  system  of  normal 
equations  by  a  measure  of  leverage.  The  GM  system  of  normal  equations  is 

Xi  =  0  where  ;r;is  a  measure  of  remoteness  in  X-space.  When  the  7i- 

weights  are  located  both  inside  and  outside  the  argument  of  the  -function,  the  GM 
objective  fimction  is  Schweppe  (Handsin  et  al.,  1975).  If  the  Ti-weights  are  not  mside  the 
argument,  the  Gil/ objective  fimction  is  Mallows  (Mallows,  1975).  In  practice,  the 
distinction  between  the  two  objective  functions  is  that  Mallows  will  downweight  high- 
leverage  points  independently  of  the  residual  value  while  Schweppe  will  not  downweight 
if  the  response  value  is  in  line  with  the  regression  plane.  Thus,  Mallows  does  not  fully 
incorporate  “good  outliers”  in  the  parameter  estimates.  There  are  several  approaches  to 
forming  the  7t-weights  that  use  some  form  of  the  leverage  measures  discussed  in  Section 
2.5.4. 

A  numerical  optimization  scheme  is  required  to  solve  the  GM  system  of  nonlinear 
normal  equations.  The  two  most  comitKin  approaches  are  also  Newton’s  method  and 
IRLS  as  in  M-estimatioa  The  initial  parameter  estimates  are  most  often  from  one  of  the 
HBP  estimators  in  Section  2.5.2.  The  final  parameter  estimates  can  come  from  a  fully 
iterated  solution  (GM-estimator)  or  only  a  single  iteration  (compound  estimator).  The 
single  iteration  method  preserves  the  breakdown  of  the  initial  estimator  (Simpson, 
Ruppert,  and  Carroll,  1992). 


fi 


I 

S7t, 
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MM-Estimators.  Yohai  (1987)  and  Yohai  et  al.  (1991)  introduce  MM  estimators 
that  achieve  the  high-efficiency  of  M-estimators  and  are  also  high-breakdown.  The  first 
stage  of  the  three  stage  procedure  calculates  an  5-estimate  with  influence  fimction 

p{x)  =  7>{^/ T  -  3^ y  +  y  if  I x|  <  c ;  otherwise  p(x)  =  1 .  The  value  of  the  tuning 


constant,  c,  is  selected  as  1.548.  The  second  stage  calculates  the  MM  parameters  that 


provide  the  minimum  value  of  ^  pj 


// 

\\ 

1 

^0  ; 

where  p(x)  is  the  influence  function 


n 

of  scale  as  the  solution  to  («  -  p)  ^^p 


used  in  the  first  stage  with  tuning  constant  4.687  and  is  the  estimate  of  scale  fi'om  the 
first  step  (standard  deviation  of  the  residuals).  The  final  step  computes  the  MM  estimate 

\  ^ 

The  MM  estimator  generally  performs  well  except  in  areas  with  high-leverage 
(Simpson  and  Montgomery,  1998b).  S-Plus  version  4.5  has  included  the  MM  estimator 
with  the  Yohai  et  al.  (1991)  test  for  bias  on  the  robust  regression  menu. 


i=l 


2.5.4  Leverage  Measures  in  Robust  Regression 

One  objective  of  this  research  is  to  improve  multi-staged  GM  and  compovmd 
estimators.  Another  &ctor  to  improve,  other  than  the  HBP  initial  estimator,  is  the  n- 
weights  that  measure  the  remoteness  in  X-space.  Some  other  measures  of  leverage 
beyond  the  hat  diagonals  and  the  Mahalanobis  Distance  used  in  robust  regression  are  the 
M-estimates  of  covariance,  the  minimum  volume  ellipsoid  (MVE),  and  the  minimum 
covariance  determinant  (MCD). 
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M-Estimates  of  Covariance.  Hampel  (1973)  first  suggested  M-estimates  of 
covariance,  but  the  basic  paper  on  these  estimators  is  attributed  to  Maronna  (1976). 
Maronna  addressed  the  problems  of  existence,  umqueness,  asymptotic  distribution  and 
breakdown  point  for  these  estimators.  We  are  interested  in  the  distances  in  X-space  for 

each  observation  defined  by  z  =  A(x  - 1)  where  A  is  an  estimate  of  the  p  xp  multivariate 
scatter  matrix  and  t  the  muhivariate  location  vector.  Note  that  (A  A)  is  the  estimate  of 
the  covariance  matrix  of  X.  From  Huber  (1981),  the  maximum  likelihood  estimate  of  A 
and  t  is  determined  by  solving  the  simultaneous  equations 

ave{w(|z|)2}  =  0 
ave({M|2|)rz’^  -  v(lz|)Ip}  =  0 

where  u,  v  and  w  are  arbitrary  weight  functions  and  ave{-}  is  the  average  taken  over  the 
sample.  We  solve  these  equations  using  the  Newton  algorithm  and  Huber  weight 
functions  vrith  the  associated  constants  and  correction  factors  as  defined  in  the  ROBETH 
library  (Marazzi,  1993). 

The  following  steps  summarize  the  ROBETH  library  implementation  of  the  M- 
estimates  of  covariance  procedure  to  compute  robust  distances  for  the  nxp  matrix  X  with 
elements  Xy  for  /  =  1  to  «  and  y  =  1  to  p. 

1 .  Find  initial  estimates  of  A  and  t.  ij  =  med  {x,j }  and  A  is  a  diagonal  matrix  with 
diagonal  elements  djj  =  \lmed{\  Xy  -  med{Xy)  |}/ 0.6745 . 

2.  Find  the  constant  parameters  (a,  b,  c,  d)  of  the  arbitrary  weight  functions  m(z) 
(Huber’s  weight  function),  vfz),  and  w(z)  by  first  specifying  the  expected  proportion 
of  outliers  in  the  sample,  8. 
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u(z)  =  for 

1  for  <b^ 

for 

a^  =  max(p  -  k)  and  b^=p  +  K.  The  value  for  k  comes  from  regula-falsi  solution  to 


^dr+^e  '^'^r'’~^dr+  jf— 1  r’’  'dr 

a  ^ 


where  kp  denotes  the  surface  of  the  unit  sphere  in  dimension  p’,  k.  =  IT{pl2). 


v(z)  =  d  for  all  z 

d  =  1  +  6"(l  -  jj(6^))}+ 

^(z)  =1  for  z  <  c 

dz  for  z  >  c 

A  Newton  procedure  solves  for  c  in  e~‘  !  y/ldr  +  c(0(c)  -  (1  —  f  /  2)  /(I  -  f ))  =  0 


^  A 

3.  Calculate  =  A(Xj  —t)  for  /  =  1  to  n 

0  =  2  1  to/' 

/ 

•^2  =  2  I)  +  ^'(1  I)  I  I  ^  -P} 

/ 


4.  Compute  a  lower  triangular  matrix  of  improvements  S  =  (sjk) 
Si/= — — — {a.! -{b -c)e -d\  j=\\op 

^  l{a+by^  ’ 


^/t  = — ^-==-Oik 
^  (fl+fe)  ^ 

^jk  “t) 


j>k 
j  <  k 


where  d  =  z*  |)  |z,.  f;b=  (ri{p  +  2)'‘)2«'(l  1)  I  P  ; 


c  =  w'‘2]v’(|  z^\)\z,\-,d=  n'*  J]v(|  Z;  |) ;  and  e  = 


dp-d 


2{d  +  b)  +  p{b  -c) 


5.  Update  the  location  estimate,  ij  =  tj  +  hj  where  hj  =  Vj  /Sj  from  step  3. 

A  A  A 

6.  Update  the  scatter  matrix,  A  =  (I  -  where  Ag  is  the  initial  (or  current) 
estimate  of  the  scatter  matrix  and  y  is  the  step  length  based  on  the  maximum  value  of 


7.  Check  for  convergence  on  the  location  and  scatter  matrix  estimates  and  go  to  step  3  if 
improvement  is  still  possible  based  on  the  specified  tolerance  values.  Otherwise 
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calculate  the  distances  from  the  M-estimates  of  covariance  with  the  current  location 
and  scatter  estimates. 

Minimum  Volume  Ellipsoid  (MVE).  Rousseeuw  (1985)  proposes  the  MVE  as  a 
high-breakdown  estimate  for  the  mean  and  covariance  matrix.  The  MVE  is  the  smallest 
ellipsoid  covering  just  over  half  of  the  data.  The  robust  estunates  of  the  mean  and 
covariance  matrix  come  from  the  classical  calculation  of  these  quantities  only  using  the 
subset  of  observations  that  are  contained  in  the  MVE.  The  original  algorithm  uses 
random  resampling  to  find  the  subset  of  observations  that  is  covered  with  the  smallest 
volume  ellipsoid.  The  algorithm  is: 

1 .  Form  a  random  sample  of  size  q=p+\  from  the  n  observations. 

2.  Compute  the  classical  mean  and  covariance  matrix  for  this  sample  of  size  q. 

3.  Compute  the  Mahalanobis  distance  for  all  h  observations  with  the  estimators 
in  (2). 

4.  Increase  the  sample  to  size  nil  +  1  by  adding  the  nil  +  \-q  observations  with 
the  least  Mahalanobis  distance  from  (3)  and  conpute  the  mean  and  covariance 
matrix. 

5.  Compute  the  product  of  the  median  Mahalanobis  distance  and  the  covariance 
matrix  from  (4). 

6.  The  determinant  of  the  quantity  in  (5)  is  proportional  to  the  volume  of  the 
ellipsoid  covering  these  observations. 

7.  Iterate  steps  1  through  6  for  the  specified  number  of  random  samples  to 


evaluate. 
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8. 

9. 


Hawkins  (1993)  improved  the  algorithm  using  steepest  deseent  with  random 
restarts  rather  than  the  random  sampling  method.  Woodruff  and  Rocke  (1993)  propose  a 
heuristic  search  optimization  procedure.  Currently  S-Plus  4.5  uses  genetic  algorithms 
(Bums,  1992). 

Unfortunately,  the  MVE  is  inefficient  with  asymptotic  efficiency  of  0  (Davies, 
1992).  The  implementation  in  S-Plus  goes  an  additional  step  to  increase  efficiency.  All 
remaining  observations  apart  from  those  in  the  MVE  are  added  back  to  compute  the  final 
estimate  of  the  mean  and  covariance  matrix  if  their  Mahalanobis  distances  (calculated 
with  the  original  MVE  estimates)  are  less  than  a  cutoff  value  from  the  chi-square 
distribution.  This  significantly  increases  the  efficiency  of  the  estimates.  An  alternative 
estimator  is  the  Minimum  Covariance  Determinant  (MCD)  that  was  also  introduced  by 
Rousseeuw  (1985).  The  MVE  is  an  w  estimator  and  the  MCD  is  n  (Butler  et  al.. 


Compute  the  mean  vector  and  covariance  matrbc  for  the  sample  that  yields  the 
minimum  value  for  the  quantity  in  6. 

Correct  the  covariance  matrix  for  small  sample  sizes  (Rousseeuw  and  van 
Zomeren,  1991)  and  consistency  at  multivariate  normal  distributions  with  the 


.  (1  +  15 /(«-;?))' 

quantity  ^ ^ 

Xp-\fi.50 


1993). 


47 


Minimum  Covariance  Determinant  (MCD).  The  MCD  searches  for  the  sample  of 
size  q<  n  that  has  the  minimum  value,  among  all  samples  evaluated,  of  the  determinant 
of  its  covariance  matrix.  The  estimator  is  most  often  a  50%  breakdown  estunator  so  ^  is 
set  to  the  integer  part  of  (n+p+  l)/2.  The  algorithm  has  evolved  fi-om  random 
resampling  much  like  the  MVE.  The  improvements  proposed  by  Hawkins  (1994), 
Woodruff  and  Rocke  (1994)  and  the  genetic  algorithms  (Bums,  1992)  parallel  those  of 
the  MVE.  Butler  et  al.  (1993)  prove  that  the  MCD  has  much  better  statistical  properties, 
notably  efficiency,  than  the  MVE.  Several  other  authors  recommend  the  MCD  over  the 
MVE  (Simpson  and  Chang,  1997,  simulations  in  Rocke  and  Woodruff,  1994);  however, 
Rocke  and  Woodmff  (1997)  do  not  recommend  either  as  stand-alone  procedures  because 
of  computational  complexity  in  high-dimension.  It  is  not  known  how  the  genetic 
algorithms  perform  as  stand-alone  procedures.  Rocke  and  Woodruff  (1997)  recommend 
their  hybrid  procedure  from  1996. 

Rocke  and  Woodruff  Hybrid  Procedure.  Rocke  and  Woodruff  (1996)  propose  a 
complex  algorithm  to  detect  outliers  from  multivariate  normal  samples  large  in  both 
dimension  and  the  number  of  observations.  Their  procedure  combines  several  results 
from  the  literature  to  form  a  hybrid  robust  estimator  of  location  and  scale  with  attractive 
properties.  This  estimate  is  then  used  to  compute  the  robust  distance  only  for  each 
observation  and  does  not  consider  regression  data  and  residuals.  The  estimator  has  up  to 
a  40%  breakdown  point  compared  to  the  usiial  breakdown  of !/(!+/?)  for  robust 
estimators.  Additionally,  the  estimator  is  affine  equivariant  so  linear  transformations  on 
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the  data  will  not  affect  the  performance.  This  measure  of  leverage  has  not  been  used  in  a 
GM  or  compoimd  estimator. 

Rocke  and  Woodruff  use  a  two-phase  approach.  The  output  of  the  first  phase  is 
an  estimate  of  multivariate  location  and  shape.  The  first  step  is  to  equally  partition  the 
data  into  cells  to  minimize  computational  burden.  Within  each  cell,  the  observations 
fi’om  the  minimum  covariance  determinant  (MCD)  using  Hawkins  (1993)  steepest 
descent  algorithm  with  random  restarts  are  the  starting  point  for  the  sequential  point 
addition  algorithm  fi-om  Hadi  (1992).  This  result  is  then  used  as  a  starting  point  for  the 
translated  bi-weight  M-estimation  of  the  mean  and  covariance  matrices.  Rocke  and 
Woodruff(1993)  use  their  previously  published  simulation  results  to  justify  using  the 
constrained  ^/-estimator  over  the  bi-weight  5-estimator.  The  robust  covariance  and 
location  matrices  are  found  by  using  the  estimators  fi’om  the  cell  with  the  minimum 
determinant  of  the  sample  covariance  matrfac. 

The  second  phase  runs  a  simulation  to  determine  the  appropriate  cutoff  value  to 
classify  observations  as  outliers  based  on  n  observations  in  p  dimensions  using  clean 
multivariate  normal  data  in  the  Phase  I  algorithm.  New  location  and  shape  matrices  are 
formed  by  those  observations  below  the  simulated  cutoff  value.  The  robust  distance  is 

calculated  using  these  new  location  and  shape  matrices  and  compared  to  a  Xp,i-o  critical 

value  to  classify  the  observation  as  outlying  or  not. 

Their  results  show  no  problems  with  swamping  when  no  outliers  are  present 
based  on  simulations  with  10  -  40  variables  and  samples  sizes  fi-om  50-3200.  The 
proposed  hybrid  estimator  significantly  outperformed  Rousseeuw’s  (1985)  random  search 
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over  elemental  subsets  and  marginally  outperformed  the  forward  search  of  Hadi  (1992). 
Their  algorithm  worked  well  on  the  smaller  published  “challenging”  sets.  Other 
significant  results  for  their  procedure  include: 

•  Identification  of  outliers  is  easier  if  the  outliers  lie  in  more  than  one  cluster. 

•  In  higher  dimensions,  outlier  detection  is  more  difficult,  more  data  is  required, 
and  the  breakdown  is  lower. 

•  Increasing  sample  size  increases  the  probability  of  correctly  identifying 
outliers. 

•  For  reasonable  computation  time,  breakdown  is  roughly  30-40%  in  dimension 
10, 25-35%  in  dimension  20,  and  20-25%  in  dimension  40. 

2.6  Variable  Selection  Procedures 

An  important  aspect  of  building  a  regression  model  is  to  decide  which  regressor 
variables  should  be  included  in  the  model.  The  P  vector  is  partitioned  into  an  active 
variable  set,  p ,  of  ^  parameters  and  inactive  set  p  j  of  ^  parameters  to  test  the 

hypothesis  that  Ho:  P  2  =  ® 

Ha:  Pz’^O. 

Failure  to  reject  the  null  hypothesis  suggests  there  is  no  evidence  that  any  of  the  regressor 
variables  in  set  p  j  have  any  affect  on  the  response  value. 

The  goal  of  a  variable  selection  procedure  is  to  have  the  significant  regressor 
variables  included  in  set  p ,  with  high  probability,  while  simultaneously  achieving  a  high 
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probability  that  the  insignificant  variables  are  contained  in  set  p  j  regression  model 
building  strategy  is  an  iterative  process  that  involves  selection  of  an  active  subset  of  thep 
parameters  followed  by  model  diagnostics  to  assess  the  fit.  The  objective  is  to  find  the 
best  subset  of  the  p  parameters  to  include  in  the  model  that  leads  to  good  prediction 
capability  yet  minimizes  the  variance  of  prediction.  The  former  objective  would  suggest 
including  all  p  variables  while  the  latter  suggests  using  as  small  of  a  subset  as  possible 
because  the  variance  of  prediction  always  increases  as  regressor  variables  are  added  to  a 
model.  Models  with  fewer  variables  are  also  preferred  for  simplicity  and  ease  of  data 
collection. 

2.6.1  Variable  Selection  in  Regression 

There  are  numerous  variable  selection  methods  available  to  the  analyst.  The 
simplest  is  to  retain  only  the  variables  whose  ratio  of  coefficient  to  standard  error  is 
significant.  This  /-test  approach  is  not  reliable  as  dimension  increases  and  particularly 
when  dependencies  between  regressor  variables  exist.  A  common  alternative  is  the  class 
of  computer-intensive  variable  selection  methods  (e.g.  forward,  backward,  stepwise,  and 
best  subsets  regression).  The  selection  criteria  are  often  based  on  F-tests  (F-to-enter  and 

n 

F-to-leave)  or  Mallows’s  (1973)  Cp  criterion;  -  y^f  -n  +  2p  where  jp^is 

/=1 

the  predicted  value  and  <t^  is  typically  the  MSE  fi’om  the  full  model.  Unfortunately, 
Miller  (1990)  demonstrates  that  the  F  tests  and  Mallow’s  Cp  criterion  are  poor  for  model 
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selection  as  are  the  and  adjusted  measures.  Breiman  (1995)  states  that  the 
preferred  measure  of  performance  for  variable  selection  in  regression  is  prediction  error. 

Resampling  methods  are  currently  recommended  to  calculate  a  measure  of 
prediction  error  for  variable  selection.  The  two  most  common  resampling  methods  are 
cross-validation  and  bootstrapping.  Cross-validation  procedures  partition  the  data  into 
two  disjoint  sets.  The  model  is  fit  with  one  set  (the  training  set)  and  subsequently  used  to 
predict  the  responses  for  the  observations  in  the  second  set  (assessment  set).  Bootstrap 
procedures  form  many  samples  from  the  original  data  by  resampling  with  replacement. 
Details  of  the  methods  and  their  application  to  the  variable  selection  problem  in 
regression  are  outlined  below. 


2.6.2  Cross-Validation  Procedures 

An  intuitively  appealing  method  to  calculate  a  predicted  response  value  is  to  use 
the  parameter  estimates  from  the  fit  obtained  by  omitting  the  observation.  This  predicted 


n 

response  value  is  denoted  by  and  -j)(,))^  is  known  as  the  leave-one- 


/=i 


,-y 


out  cross-validation  estimate  of  average  prediction  error  for  a  model.  Apart  from  the  ri 
term,  this  quantity  is  the  PRESS  statistie  in  least  squares.  For  OLS,  the  PRESS  statistic 


i=l 


\^-Kj 


is  calculated  as  ^  where  =xXX'X)''x;.  Note  that  PRESS  does  not  require  n 


separate  fits  while  other  regression  estimators  (e.g.  robust)  do  require  all  n  fits  for  the 
leave-one-out  cross-validation  estimate  of  prediction  error.  Shao  (1993)  proves  with 
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asymptotic  results  and  simulations  that  the  model  with  the  minimum  PRESS  statistic  or 
leave-one-out  cross-validation  estimate  of  prediction  error  is  often  overfit.  He 
recommends  using  K-fi)ld  cross-validation  that  leaves  a  subset  of  observations  out. 

Quenouille  (1949)  explored  the  idea  of  leaving  two  observations  out  of  the 
training  set  and  Stone  (1974)  extended  the  method  to  more  than  two.  In  K-fold  cross- 
validation,  the  training  set  omits  approximately  n/K  observations  from  the  training  set 
rather  than  a  single  observation  like  PRESS.  To  predict  the  values  for  the  assessment 
set,  Sk,a,  all  observations  apart  fi’om  those  in  set  A:  are  the  training  set,  S*,/,  and  are  used  to 
estimate  the  model  parameters.  The  K-fold  cross-validation  average  prediction  error  is 

-  where  ,)  is  the  predicted  response  for  observation  / 

/=i 

belonging  in  assessment  set  5*.o.  ■ 

One  approach  to  the  K-fold  cross-validation  estimate  of  prediction  error  is  to 
randomly  select  the  »/K  observations  to  form  the  assessment  set.  This  process  is  repeated 
numerous  times  and  the  prediction  errors  are  averaged.  Breiman  et  al.  (1984)  propose  a 
less  computationally  intense  scheme  that  randomly  partitions  the  data  into  K  different 
disjoint  sets.  Davison  and  Hinkley  (1997)  recommend  K  =  min  («‘^,  10)  in  practice. 

This  procedure  decreases  the  variance  of  prediction  error  over  that  of  the  leave-one-oiit 
cross-validation  estimate  but  at  the  expense  of  increased  bias.  Surprisingly,  Shao  (1993) 
demonstrates  that  the  smaller  the  training  set  (larger  value  of  K),  the  better  the  K-fold 


estimate  is  for  model  selection. 
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To  reduce  the  bias,  Burman  (1990)  recommends  the  adjusted  K-fold  cross- 
validation  estimate  of  prediction  error  as 


K  ( 


^^ACV,K  ~'^CV,K'^'^App  ^(*,»)) 


where  pk  is  the  ratio  of  observations 


mth  • 

in  assessment  set  k  to  the  total  n  and  ^(t,/)is  the  predicted  response  for  the  i  observation 


from  the  fit  with  training  set  St,  *.  The  Breiman  and  Spector  (1992)  simulations 
demonstrate  that  the  performance  of  the  adjusted  cross-validation  prediction  error 
estimate  is  slightly  worse  than  the  standard  K-fold  cross-validation  prediction  error  for 
least  squares  variable  selection.  Shao  (1993)  shows  that  both  the  leave-one-out  and  K- 
fold  cross-validation  procedures  have  a  negligible  probability  of  selecting  an 
underspecified  model.  The  challenge  is  avoiding  an  overfit  model. 


2.6.3  Bootstrap  Procedures 

Bootstrap  estimators  in  regression  have  received  considerable  attention  in  the 
literature  since  their  introduction  by  Efron  (1979).  Wu  (1986)  provides  the  theoretical 
results  for  bootstrap  methods  applied  to  regression.  Hall  (1989)  proves  that  inference  in 
regression,  such  as  confidence  intervals,  based  on  the  bootstrap  estimate  are  more 
accmate  than  standard  inference  procedures  even  if  the  error  is  Gaussian. 

The  fundamental  element  of  a  bootstrap  procedure  is  the  bootstrap  sample.  For 
bootstrapping  pairs  in  regression  (Efron,  1982),  the  sample  is  formed  by  randomly 
sampling  with  replacement  n  times  both  a  response  and  its  associated  vector  of  regressor 
variable  values  from  the  original  sample.  The  bootstrap  sample  may  contain  an 
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observation  from  the  original  sample  once,  multiple  times  or  not  at  all.  In  fact,  the 
probability  that  an  observation  is  included  in  a  bootstrap  sample  of  size  « is  1  -  e’  = 
0.632  (Efron  and  Tibshirani,  1997).  A  regression  model  is  then  fit  to  the  bootstrap 

sample  to  obtain  the  bootstrap  parameter  estimates  p  * .  A  large  number  of  bootstrap 
samples  (B  >  100)  are  constructed  from  the  original  sample  for  model  inference. 

For  the  variable  selection  problem,  the  estimate  of  average  prediction  error  for  the 


b'*  bootstrap  sample  is  Aj  =  where  are  from  the  original  sample. 

/=! 

Efron  (1983)  provides  the  unbiased  estimator  of  prediction  error  as 


=n'‘£o',  -"■‘io'. 

i=l  «=1  '=1 

vector  of  regressor  values  fi3r  the  observation  in  the  bootstrap  sample.  The  overall 

-1'^ 

bootstrap  estimate  of  average  prediction  error  is  simply  =  B  /  Ab.tmbimed  •  Shao 

6=1 


(1996)  shows  that  selecting  the  model  with  the  minimum  A^j  is  inconsistent. 

Inconsistency  implies  that  the  probability  the  true  model  has  the  minimum  bootstrap 
average  prediction  error  does  not  equal  1 .0  as  «  approaches  infinity.  Shao  corrects  this 
inconsistency  for  bootstrapping  pairs  by  using  substantially  fewer  than  n  observations  to 
construct  the  bootstrap  samples.  This  procedure  does  not  use  the  bias-corrected  estimate 
of  prediction  error.  Breiman  (1996),  motivated  by  increasing  the  0.632  probability  that 
an  observation  is  selected  in  a  bootstrap  sample,  notes  that  using  bootstrap  samples  of 
size  2/7  has  little  effect  on  the  results  for  least  squares  variable  selection. 
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2.6.4  Other  Modifications  to  Resampling  Methods  for  Variable  Selection 

Breiman  and  Spector  (1 992)  explore  the  use  of  cost  admissibility  (penalty  for 
adding  variables)  with  bootstrap  and  cross-validation  prediction  error  for  variable 
selection.  Their  empirical  results  indicate  that  this  modification  is  only  slightly  beneficial 
to  the  variable  selection  process.  This  is  an  important  result  because  most  resampling 
estimates  of  prediction  error  do  not  account  for  the  number  of  variables  in  the  model. 

Breiman  (1992)  recommends  the  little  bootstrap  estimate  of  prediction  error  for 
variable  selection  in  linear  models.  The  prediction  error  for  a  k  variable  model  using  this 
approach  is  k^pp{k)  +  2B,(k).  The  little  bootstrap  error, 5, (^),  is  the  resubstitution  error 

from  the  model  selected  using  y*  =  y  +e  where  e  is  a  vector  of  variates  from  NID  (0, 

t^ar^ )  with  0.6  <t<  0.8.  The  MSe  for  the  full  model  is  used  as  an  estimate  of  . 
Breiman  shows  that  the  little  bootstrap  is  imbiased  and  superior  to  Cp,  F-to-enter,  and  F- 
to-leave  for  variable  selection  for  fixed  designs. 

Breiman  (1996)  suggests  baling  (bootstrap  aggregating!  regressor  variables. 

For  each  of  the  B  samples  formed  by  bootstrapping  pairs,  perform  a  forward  selection  to 
obtain  a  1  variable  model,  2  variable  model,  ...k  variable  model.  The  nxk matrices  of 
predicted  values  from  these  k  models  are  averaged  across  the  B  bootstrap  samples.  The 
model  with  the  lowest  average  prediction  error  is  selected.  Limited  simulation  results 
indicate  that  this  procedure  performs  better  than  standard  forward  selection.  It  is  unclear 
how  to  proceed  if  the  same  variables  are  not  consistently  selected  in  the  B  samples  for  a 
given  dimension. 


Davison  and  Hinkley  (1997)  describe  a  hybrid  estimate  of  bootstrap  prediction 
error  for  variable  selection  adapted  from  Efron  and  Tibshirani  (1997).  The  hybrid 
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estimate  of  prediction  error  weights  the  apparent  error  and  the  bootstrap  cross-validation 
error  calculated  from  the  predicted  values  of  those  observations  not  included  in  the 
bootstrap  sample.  The  authors’  empirical  evidence  suggests  this  procedure  is  superior, 
although  no  results  are  published. 

2.6.5  Variable  Selection  with  Robust  Regression  Estimators 

Although  numerous  estimators  have  been  proposed  in  the  last  25  years,  there  are 
significantly  fewer  results  in  the  literature  that  explore  variable  selection  procedures  in 
the  robust  regression  model.  Most  robust  regression  variable  selection  methods  are  based 
on  robust  versions  of  the  general  linear  test  that  use  the  asymptotic  covariance  matrix 
(Hampel  et.  al,  1986).  Markatou  and  He  (1994)  and  Hertier  and  Ronchetti  (1994)  extend 
the  Wald  (similar  to  /-tests)  and  drop-in-dispersion  tests  (similar  to  F-tests)  to  GAf  and 
compound  estimators.  Field  (1997)  and  Field  and  Welsh  (1998)  propose  saddlepoint 
approximations  of  tail  area  probabilities  for  robust  regression  hypothesis  testing  as 
improvements  to  the  asymptotic  approach.  The  results  are  mixed  and  they  recommend 
further  testing  in  finite  samples.  Ronchetti  and  Staudte  (1994)  propose  a  robust  version 
of  Mallows’s  Cp.  This  method  multiplies  the  squared  residuals  by  the  final  weights  from 
a  robust  fit  to  compute  the  residual  sum  of  squares.  Two  additional  quantities  are  also 
added  to  the  residual  sum  of  squares  that  are  a  function  of  the  number  of  parameters  and 
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the  selected  robust  estimator.  The  robust  Cp  appears  to  work  satisfactorily  for  their  three 
examples,  but  no  simulation  results  are  reported. 

The  Wald  test  is  currently  preferred  (Hertier,  1997)  because  of  its  asymptotic  chi- 
square  distribution  and  the  relative  ease  to  calculate  the  as5miptotic  covariance  matrix, 
Wilcox  (1997)  experiments  (results  not  reported)  with  the  Wald  test  using  the  M- 
estimator  and  the  Coakley  and  Hettmansperger  (1993)  compound  estimator.  He  found 
for  both  estimators,  even  with  normal  and  homoscedastic  error  terms  and  n  =  100,  poor 
control  over  Type  I  error.  All  authors  conclude  that  it  is  important  to  do  further  testing 
and  evaluation  to  understand  the  strengths  and  weaknesses  of  the  methods  in  finite 
samples. 

A  common  use  of  resampling  methods  in  robust  regression  is  construction  of 
confidence  intervals  and  prediction  intervals  with  the  bootstrap  (Efron  and  Tibshirani, 
1993,  Davison  and  Hinkley,  1997,  Wilcox,  1994, 1996a,  1996b,  1997).  Mammen(1993) 
shows  the  consistency  of  the  bootstrap  for  linear  tests  with  the  M  estimator. 

Wilcox  (1997,  1998)  presents  an  interesting  approach  to  the  variable  selection 
problem  in  robust  regression  using  a  bootstrap  resampling  scheme.  He  uses  a  percentile 
bootstrap  approach  to  find  critical  values  for  the  joint  confidence  region  of  the 
Mahalanobis  distance  for  the  model  parameters.  The  steps  of  the  algorithm  are: 

1 .  Obtain  B  bootstrap  estimates  of  P  by  bootstrapping  pairs. 

2.  Estimate  the  covariance  matrix  V  using  all  B  bootstrap  estimates  of  P  . 

3.  Find  the  Mahalanobis  distance  of  (P  ‘  -P  )  using  V'^  for  each  bootstrap 
sample  where  P  is  the  bootstrap  estimate  of  the  model  parameters  and  p  is  the 
vector  of  parameter  estimates  from  the  original  data. 

4.  Sort  the  Mahalanobis  distances  and  call  the  (l-a)B  ordered  distance  the 
critical  value. 
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5,  Find  the  test  statistic  by  the  Mahalanobis  distance  using  V*  of  (p  -  c  )  where  c 
is  a  vector  of  constants  often  selected  as  0  to  test  for  significance. 

Wilcox  (1998)  states  there  is  room  for  improvement  with  this  method  because  the 
probability  of  a  Type  I  error  can  be  substantially  less  than  nominal  levels  in  many 
circumstances.  He  states  that  this  approach  does  not  work  well  with  least  squares; 
correction  factors  through  simulation  are  required  to  achieve  the  correct  coverage 
probabilities. 

Davison  and  Hinkley  (1997)  provide  a  brief  discussion  of  resampling  methods  in 
robust  regression.  Their  guidance  on  resampling  methods  for  variable  selection  in  robust 
regression  focuses  on  two  main  points:  1)  remove  gross  outliers  from  analysis  because 
too  many  outliers  could  appear  in  the  resampled  data  leading  to  inefficiency  and 
breakdown  and  2)  most  of  the  prediction  error  methods  for  least  squares  should  apply  to 
robust  regression.  They  recommend  that  gross  outliers  be  removed  by  large  residuals 
from  an  LTS  fit. 

2.7  Literature  Review  Summary 

This  chapter  has  reviewed  the  relevant  published  results  to  the  research 
objectives.  Clearly,  there  are  numerous  options  available  for  the  multiple  outlier 
detection  problem  with  few  comparable  results  available  between  methods.  A 
comprehensive  performance  study  is  missing.  Also,  several  options  exist  for  the 
selection  of  components  in  multi-staged  GM  and  compoimd  estimators.  A  critical 
evaluation  of  these  components  could  lead  to  improved  performance.  Lastly,  the  variable 
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selection  problem  has  not  been  fully  explored  for  multi-staged  GM  and  compound 
estimators.  The  usefulness  of  variable  selection  resampling  methods  has  not  been 
thoroughly  investigated. 


Chapter  3 

A  Comparative  Analysis  of  Multiple  Outlier  Detection  Procedures 


3.1  Introduction 

There  has  been  considerable  interest  in  recent  years  in  the  detection  and 
accommodation  of  multiple  outliers  in  statistical  modeling.  This  chapter  uses  Monte  Carlo 
simulation  to  evaluate  numerous  recently  published  outlier  techniques  in  the  linear 
regression  model.  Kianiferd  and  Swallow  (1990)  report  a  similar  smaller  study  using  a 
few  earlier  techniques.  Other  comparative  analyses  typically  appear  in  journal  articles 
where  the  authors  propose  a  new  methodology;  however,  these  studies  are  often  limited  in 
scope  and  breadth  of  techniques.  Our  approach  tests  the  latest  and  most  respected 
multiple  outlier  detection  procedures  across  a  number  of  realistic  and  challenging 
regression  scenarios. 

In  general,  Barnett  and  Lewis  (1994)  define  outliers  as  observations  that  appear 
inconsistent  with  the  remainder  of  the  data  set.  For  this  paper,  we  wish  to  identify  outliers 
in  linear  regression  modeling.  Specifically,  we  are  concerned  with  observations  that  differ 
fi'om  the  regression  surftice  defined  by  the  bulk  of  the  data.  It  is  important  to  identify 
these  types  of  outliers  in  regression  modeling  because  the  observations,  when  undetected, 
r^n  lead  to  erroneous  parameter  estimates  and  inferences.  Additionally,  these  outliers  may 
be  of  interest  themselves  to  provide  insight  into  process  behavior  at  certain  operating 


conditions. 
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If  only  a  single  or  few  outliers  exist,  many  standard  least  squares  regression 
diagnostic  quantities  and  plots  will  reliably  identify  these  observations.  However,  these 
diagnostics  have  been  shown  to  fail  in  the  presence  of  multiple  outliers;  particularly  if  the 
observations  are  clustered  in  an  outlying  cloud.  The  measures  may  either  fail  to  identify 
the  outliers  (masking),  identify  the  clean  observations  as  outliers  (swamping),  or  could 
both  mask  and  swamp  observations.  To  overcome  the  limitations  of  the  standard  least 
squares  diagnostics,  numerous  multiple  outlier  detection  techniques  have  been  proposed  to 
identify  the  outlying  subset  of  observations. 

The  outlying  observations  can  be  remote  in  the  levels  of  the  regressor  or 
ejqilanatory  variables  (exterior  X-space  observations).  These  are  considered  high-leverage 
points  because  they  are  influential  and  pull  the  regression  surface  toward  them.  We  refer 
to  cases  that  are  not  imusual  in  X-space  as  interior  X-space  observations.  Observations 
can  also  be  outlying  in  the  response  variable  (Y-space)  because  of  distant  values  from  the 
responses  of  the  clean  cases.  Further  classification  of  outliers  is  possible  with  respect  to 
the  regression  model.  If  the  observations  do  not  conform  to  the  regression  surfiice  defined 
by  the  bulk  of  the  data,  then  these  cases  are  known  as  regression  or  residual  outliers.  We 
are  concerned  >vith  two  main  outlier  configurations  likely  to  be  encountered  in  practice:  1) 
observations  that  are  interior  X-space  regression  outliers  and  2)  observations  that  are 
exterior  X-space  regression  outliers.  We  consider  testing  these  scenarios  when  the 
response  variable  for  the  outliers  is  a  Y-space  outlier  and  when  it  is  not.  A  third  important 
outlier  scenario  occurs  when  the  observations  are  remote  in  X-space  but  the  response 
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values  conform  to  the  regression  surface.  We  limit  the  scope  of  this  chapter  by  not 
including  these  high-leverage  “good  outliers”  in  the  study. 

Section  3.2  briefly  describes  the  multiple  outlier  detection  procedures  used  in  this 
comparative  study.  DetaUed  summaries  of  many  of  these  and  other  multiple  outlier 
detection  procedures  can  be  found  in  Chapter  2,  Hadi  and  Simonofif  (1993),  Barnett  and 
Lewis  (1994),  and  Sebert  (1997).  Section  3.3  describes  the  Monte  Carlo  simulation 
scenarios,  factors,  fector  settings  and  the  measures  of  performance;  Section  3.4  provides 
the  simulation  results  and  analysis;  and  Section  3.5  summarizes  the  results  for  each 
procedure  and  provides  recommendations. 

3.2  Multiple  Outlier  Detection  Procedures 

The  multiple  outlier  detection  methods  for  linear  regression  selected  in  this  study 
are  either  those  most  recently  published  or  those  most  frequently  cited  in  the  literature. 

We  do  not  consider  many  of  the  previously-published  methods  that  have  been  tested  and 
proven  to  be  either  ineffective  or  too  restrictive  in  assumptions  (e.g.,  specifying  the  exact 
number  of  outliers).  We  do  not  consider  (but  do  advocate)  the  subjective  evaluation  of 
the  data  from  various  multivariate  plots  to  identify  the  outliers  as  suggested  by  Atkinson 
and  Riani  (1997)  and  Cook  (1998),  among  others. 

It  is  convenient  to  consider  two  broad  classes  of  multiple  outlier  detection 
procedures  as  defined  by  Hadi  and  Simonoff  (1993):  direct  methods  and  indirect  methods. 
The  direct  methods  use  algorithms  to  isolate  outliers  and  the  indirect  methods  use  the 
results  from  robust  regression  estimators.  The  description  of  both  the  direct  and  indirect 
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procedures  below  considers  the  standard  linear  model  y  =  xp  +8  where  y  is  the  observed 
response  vector  of  dimension  n,  the  number  of  observations;  X  is  the  observed  nxp 
matrix  of  regressor  variables  with  intercept;  and  e  is  the  column  vector  of  n  random  errors 
assumed  to  have  mean  0  and  covariance  matrix 

3.2.1  Direct  Procedures 

Many  of  the  direct  procedures  in  the  literature  are  based  on  either  sequential 
deletion  (backward  search)  of  outlying  observations  or  sequential  addition  (forward 
search)  of  clean  observations.  In  a  backward  search,  the  entire  set  of  observations  is 
initially  considered  and  the  outliers  are  sequentially  removed  by  a  criterion  such  as  the 
largest  absolute  value  of  some  transformed  residual.  The  forward  search  works  similarly. 
A  small  subset  of  the  data  is  selected  as  the  initial  clean  basis  and  clean  observations  are 
sequentiaUy  added  to  this  basis.  Methods  using  forward  search  generally  outperform 
backward  search  methods  (Simonofl^  1991,  Atkinson  and  Riani,  1997).  We  consider  the 
forward  search  procedures  from  Hadi  and  Simonoff  (1993, 1997)  and  Swallow  and 
Kianifrird  (1996).  We  also  consider  the  direct  procedure  based  on  the  eigenstructure  of 
the  influence  matrix  from  Pena  and  Yohai  (1995)  and  the  clustering  algorithm  from  Sebert 
et  al.  (1998).  The  general  steps  of  these  algorithms  and  specific  issues  related  to  this 
research  are  outlined  below.  For  most  of  these  procedures,  the  authors  provide  alternative 
algorithms  and  parameter  settings.  Our  philosophy  is  to  choose  the  best  performing 
options  determined  from  our  pilot  studies,  the  authors’  published  results  or  both. 
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The  Hadi  and  Simonoff  (1993)  forward  search  algorithm.  This  procedure 
initially  determines  a  clean  basis  ofp  +  1  observations  from  the  smallest  absolute  value  of 


the  adjusted  residual  from  a  least  squares  fit,  a,  =  ej^jl-hj^  .  This  basis  is  iteratively 

increased  to  the  initial  clean  subset  of  size  v  =  («  +  p  +  l)/2  by  using  the  lowest  values  in 
magnitndp;  of  least  squares  scaled  residuals.  Next,  the  absolute  values  of  the  studentized 
residual  (if  the  observation  is  in  the  current  basis,  M)  or  the  scaled  prediction  error  (if  the 
observation  is  not  in  M)  are  ordered  and  the  lowest  v  +  1  cases  become  the  new  basis  M. 
The  procedure  continues  to  add  observations  xmtil  the  (s  +  1)"  ordered  residual  measure 
exceeds  t(a/2(s+i}M)  where  s  is  the  number  of  observations  in  the  current  subset. 
Observations  5  +  1  to  n  are  the  outliers.  Hadi  and  Simonoff  (1997)  improve  this  algorithm 
by  using  the  robust  distance  measures  from  the  Hadi  (1992,  1994)  forward  selection 
algorithm  to  determine  the  initial  clean  subset  of  size  v  observations. 

The  Swallow  and  Kianifard  (1996)  recursive  residual  forward  search  algorithm. 
Swallow  and  Kianifard  suggest  recursive  residuals  standardized  by  a  robust  estimate  of 
scale  as  the  test  statistic  to  classify  multiple  outliers.  The  algorithm  first  orders  the 
magnitudes  of  the  studentized  residual  values  fi^m  a  least  squares  fit  to  form  the  basis  of/? 
clean  observations.  Recursive  residuals,  Wj,  are  scaled  by  the  median  absolute  deviation 
from  the  median  (MAD)  estimate  of  scale  a .  The  wj  are  defined  as 


(i+x'(x'_,x,.,)->x,y^^ 


,/•  =  /? +  !,...,«. 
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The  MAD  is  median  {|e,  -  median  {e/}|}  where  e,  is  the  OLS  residual,  not  the  studentized 
residual. 

The  test  statistic  /  ct|  for  each  observation  is  compared  to  a  cutoff  value  to 

identify  the  outliers.  A  correction  fector  for  the  MAD  estimate  of  scale  and  the  cutoff 
value  come  from  simulation  under  the  null  hypothesis  of  no  outliers. 

The  Pern  and  Yohai  (1995)  influence  matrix  algorithm.  This  procedure  searches 
for  breakpoints  in  the  ordered  components  within  the  eigenvectors  from  the  influence 
matrix,  M  =  EDHDE//?^^  where  E  is  the  diagonal  matrix  of  least  squares  residuals,  D  is 

the  diagonal  matrix  with  elements  (1  -  A/,)'*,  H,  the  hat  matrix ,  =  X(X'X)  *  X' ,  and  ^  is 
the  usual  mean  square  error  estimate  of  the  variance.  If  the  ratio  of  conqjonents  exceeds 
2.5,  then  consider  all  ordered  observations  after  (or  before  if  the  components  are  negative) 
this  breakpoint  as  the  candidate  outliers. 

The  Sebert,  et  al  (1998)  clustering  algorithm.  This  approach  clusters  the 
standardized  predicted  and  standardized  residual  values  from  a  least  squares  fit.  The  crux 
of  the  algorithm  is  finding  the  single  largest  cluster,  or  the  bulk  of  the  data  to  classify  as 
the  inliers.  Mojena’s  stopping  rule  forms  the  final  clusters  (single  linkage.  Euclidean 
distance)  by  splitting  a  cluster  tree  at  the  average  of  the  n  - 1  tree  cluster  heights  (a 
measure  of  cluster  separation)  plus  1.25  times  the  standard  deviation  of  the  tree  cluster 
heights. 
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3.2.2  Indirect  Procedures  from  Robust  Regression  Estimators 

Robust  regression  techniques  accommodate  outliers  by  downweighting  or  ignoring 
the  unusual  observations  to  ensure  they  are  not  too  influential  on  the  regression  parameter 
estimates.  It  is  possible  to  detect  aberrant  observations  from  either  the  final  weights 
assigned  to  the  observations  or  by  the  magnitude  of  the  residuals.  Our  research  has  shown 
the  residuals  provide  the  most  reliable  signal  to  detect  multiple  outliers.  The  cutoff  values 
to  declare  an  observation  an  outlier  from  the  residual  value  must  be  computed  by  Monte 
Carlo  simulation  because  the  distribution  of  robust  regression  residuals  is  not  known.  We 
generate  1000  clean  data  sets  from  the  specified  distribution  with  k  regressor  variables  and 
77  observations  under  the  null  hypothesis  of  no  outliers.  The  cutoff  value  is  the  average  of 
the  two  appropriate  percentiles  (e.g.,  the  2.5***  and  97.5*)  of  the  1000  *  n  residuals.  All 
robust  regression  estimators  in  this  research  have  nearly  symmetric  distributions  for  the 
residuals  of  clean  observations. 

The  multiple  outlier  detection  capability  of  several  common  robust  regression 
estimators  is  tested  in  several  scenarios.  The  common  robust  estimators  are  Least  Median 
of  Squares  (LMS),  Least  Trimmed  (sum  oQ  Squares  (LTS),  and  M-estimators.  These 
three  estimators  are  available  using  the  internal  functions  ofS-Plus  4.5.  We  also  consider 
the  MM  estimator  from  Yohai  (1987)  and  its  implementation  through  the  ROBETH  S- 
Plus  library  (Marizzi,  1993).  We  use  the  code  from  Wilcox  (1997)  for  the  standard 
boimded  influence  generalized  M-estimator  and  the  compovmd  estimator  from  Coakley 
and  Hettmansperger  (1993).  Also  tested  is  the  Simpson  and  Montgomery  (1998) 
compoimd  estimator. 
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LMS  estimator.  Rousseeuw  (1984)  introduced  the  high-breakdown  (as  much  as 
50%)  LMS  estimators.  LMS  is  obtained  by  minimizing  the  A'*  ordered  squared  residual 
where  h  is  defined  as  the  integer  portions  of  nil  +  (p^\)l2.  Note  h  is  not  the  median  of  n. 
LMS  fits  just  over  half  the  data  and  minimizes  the  residual  for  a  single  observation. 

LTS  estimator.  Rousseeuw  (1984, 1985)  proposed  the  high-breakdown  LTS 
estimator  as  an  efficient  alternative  to  LMS.  The  LTS  estimator  is  formed  by  minimizing 
the  h  out  of «  ordered  squared  residuals.  Rousseeuw  and  Leroy  (1987)  recommend  h  = 
«(l-a)  +  1  where  a  is  the  trimmed  percentage.  This  estimator  is  attractive  because  a  can 
be  selected  to  prevent  some  of  the  poor  results  (efficiency)  that  other  50%  breakdown 
estimators  show. 

M-estimator.  Huber  (1973)  developed  the  M-estimator  by  minimizing  a  symmetric 
function  of  the  residuals  over  the  parameter  estimates.  These  estimators  are  the  maximum 

n 

likelihood  solution  to  min  ^  /o(e,  /  r)  where  /?  is  the  residual  weighting  (influence) 

function,  and  s  is  the  scale  estimate  to  ensure  that  if  the  y  values  are  multiplied  by  a 
constant  c,  then  the  estimated  regression  coefficients  will  also  be  multiplied  by  c.  Several 
residual  weighting  functions  are  possible  based  on  the  downweighting  philosophy  (see 
Montgomery  and  Peck,  1992). 

MM  estimator.  The  MM  estimator  is  a  high-breakdown  and  high-efficiency 
estimator  with  three  stages.  The  initial  estimate  is  a  high-breakdown  estimate  using  an  S- 
estimate.  The  second  stage  computes  an  M-estimate  of  the  errors’  scale  fi-om  the  initial  S- 


estimate  residuals.  The  last  step  is  an  M-estimate  of  the  regression  parameters  using  a 
redescending  y/  function  that  assigns  a  weight  of  0.0  to  large  residuals. 
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Standard  Generalized  M-estimator.  This  estimator  uses  iteratively  reweighted 
least  squares  to  estimate  the  model  parameters  taking  into  account  high-leverage  points. 
The  initial  estimate  is  OLS  and  the  estimate  of  scale  is  found  by  scaling  the  median  of  the 
absolute  value  of  the  OLS  residuals.  The  hat  diagonals  are  used  as  the  measure  of 
leverage.  The  GM  objective  function  uses  Schweppe  weights  that  seek  to  improve 
eflSciency  by  assigning  less  weight  to  high-leverage  residuals. 

Coakley  and  Hettmansperger  estimator.  This  compound  estimator  uses  LTS  as 
the  initial  estimate  and  adjusts  the  estimates  with  enq)irically  determined  weights.  The 
weights  given  to  the  leverage  come  from  the  robust  distances  using  the  minimum  volxune 
ellipsoid  (MVE)  estimator.  Other  components  include  a  Schweppe-type  GM  objective 
function,  an  estimate  of  scale  from  the  scaled  median  of  the  LTS  residuals,  the  Huber  y/ 
function  and  a  one-step  Newton-Raphson  convergence  approach. 

Simpson  and  Montgomery  estimator.  This  compound  estimator  uses  an  5-estimate 
for  the  initial  estimate  and  also  an  5-estimate  of  scale.  The  scaled  Krasker-Welsch  weights 
from  the  M-estimates  of  covariance  provide  the  measures  of  leverage.  Other  conq)onents 
include  a  Schweppe-type  GM  objective  fimction,  Tukey  bi-weight  y/  function  and  a  one- 

step  reweighted  least  squares  convergence  approach. 

A  related  approach  to  the  indirect  methods  from  the  robust  regression  estimators  is 
the  Rousseeuw  and  van  Zomeren  (1990)  multiple  outlier  detection  procedure.  In  the 
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original  proposal,  observations  are  classified  as  outliers  if  either  the  LMS  residual  value 
exceeds  2.5  or  if  the  Mahalanobis  distance  measure  using  the  MVE  estimates  of  the  mean 
and  covariance  matrix  exceeds  a  percentile  fi'om  the  chi-square  distribution  with  k  degrees 
of  freedom.  The  MVE  estimate  of  the  mean  is  the  centroid  of  the  smallest  ellipse  covering 
at  least  half  of  the  observations  and  the  estimate  of  the  covariance  matrix  is  determined 
from  these  cases  along  with  a  correction  factor  for  consistency  at  multivariate  normal 
distributions.  Rousseeuw  and  van  Zomeren  (1991)  recommend  using  simulated  cutoff 
values  as  an  update  to  the  procedure  to  guard  against  swamping  problems. 

There  are  several  published  results  that  criticize  this  method  for  identifying  too 
many  outliers.  None  have  used  the  improved  genetic  algorithms  to  compute  the  MVE  and 
LMS  estimates.  These  algorithms  are  computationally  and  statistically  more  efficient 
because  more  “clean”  observations  are  used  than  in  previous  algorithms. 

3.3  Monte  Carlo  Simulation  Performance  Study  Planning 

We  use  Monte  Carlo  simulation  to  test  the  performance  of  the  multiple  outlier 
detection  procedures  across  a  wide  range  of  scenarios.  The  simulations  generate  a  fixed 
percentage  of  clean  observations  and  plant  outliers  at  locations  specified  by  the  scenario 
and  factor  settings.  The  regressor  variable  levels  for  the  clean  observations  are  generated 
from  a  multivariate  normal  distribution  with  a  mean  of  px  ~  7.5  and  standard  deviation  of 
ax  =  4.0.  The  choice  of  these  parameters  does  not  affect  the  results  of  the  simulations,  but 
is  selected  to  be  consistent  with  some  of  the  results  in  the  literature.  The  response  for  the 
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clean  observation  is  generated  by  =  x'P  +  where  ^  is  the  vector  of  known 
regression  coefiBcients  arbitrarily  selected  for  the  sunulations  to  be  0  for  the  intercept  and 
5.0  for  each  of  the  k  regressor  variables  and  is  the  random  error  term  distributed 

N(0,  a] ).  We  select  cr]  to  be  1 .00.  For  the  planted  outliers,  the  f  regressor  variable 
value  for  the  f  observation  is  x,y  =  +  45l  +  f  *  where  is  the  average  of  the 

clean  values  for  the  f  regressor,  8l  is  the  magnitude  of  the  outlying  shift  distance  in  X- 
q)ace  in  standard  deviation  units,  a*,  and  f  *  is  a  random  variate  fi’om  a  Uniform  (0, 0.25). 

We  use  the  e*  term  to  separate  multiple  observations  in  a  cloud  to  protect  against  singular 

matrices.  If  the  observation  is  a  regression  outlier,  the  response  value  is  calculated  by 
y.  =  where  5r  is  the  magnitude  of  the  outlying  distance  ofifthe  regression  plane  in 

standard  deviation  units,  ae. 

Where  practical,  simulation  studies  use  factorial  designs  to  characterize  the  effects 
of  specific  fectors  on  the  two  primary  measures  of  performance;  detection  capability  and 
fekfi  alarm  rate.  The  false  alarm  rate  is  the  probability  that  a  clean  observation  is 
swamped  and  the  complement  of  detection  probability  is  the  masking  probability.  The 
ftictors  considered  are  the  dimension  of  the  data,  the  percentage  of  outliers,  the  magnitude 
of  rmusualness  in  X-space,  8l,  the  magnitude  of  unusualness  in  residual,  8r,  the  number  of 
multiple  point  clouds,  and  the  proportion  of  regressor  variables  with  extreme  values. 

The  factor  levels  are  selected  to  develop  challenging  scenarios  used  to  discriminate 
the  performance  of  the  procedures.  Extensive  pilot  studies  were  run  to  discover  the  best 
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levels  to  not  only  challenge  the  procedures,  but  also  to  ensure  that  at  least  one  of  the 
candidates  has  detection  capability  for  most  combinations  of  the  selected  fector  levels. 

The  levels  for  dimension  are  either  k=2  variables  with  «  =  40  observations  or  k=  6 
variables  with  n  =  60  observations.  The  levels  for  the  percentage  of  outliers  are  typically 
10%  and  20%,  although  some  studies  vary  these  fector  settings.  The  levels  for  the 
outlying  distances  6l  and  8r  are  typically  between  3  and  5  standard  deviation  umts.  The 
number  of  clouds  is  selected  as  a  factor  with  settings  of  usually  1  or  2,  because  the  most 
difScult  outlier  configuration  is  a  mean  shift  with  a  single  cloud  of  observations  that  are 
clustered  close  to  one  another,  yet  not  replicated  (Rocke  and  Woodruff,  1996).  The  levels 
for  the  number  of  outlying  variables  are  either  all  k  variables,  as  commonly  seen  in  the 
literature,  one  of  the  variables,  or  3  of  6  variables. 

To  properly  and  feirly  compare  the  methods,  we  set  their  parameters  such  that  the 
expected  &lse  alarm  probability  is  5%  imder  the  null  hypothesis  of  no  outliers.  For 
example,  the  simulated  cutoff  value  for  an  indirect  robust  regression  procedure  is 
calculated  as  the  95*  percentile  of  the  absolute  value  of  the  residuals  fi-om  clean  data  (no 
planted  outliers). 

The  Monte  Carlo  simulations  are  all  performed  in  S-Plus  (the  simulation  and 
procedure  code  is  shown  in  Appendix  A)  and  are  classified  into  two  main  categories  of 
regression  outliers:  1)  interior  X-space  outliers  and  2)  exterior  X-space  outliers.  Studies 
are  made  within  each  of  these  main  categories  to  best  evaluate  the  procedures  across  a 
wide  range  of  possible  regression  scenarios.  Figure  3.1  displays  the  appropriate  section 
numbers  within  the  chapter  for  the  study  results. 
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Figure  3.1 .  Organization  chart  for  the  Monte  Carlo  simulation  studies. 


3.4  Performance  Study  Results 

Each  procedure’s  performance  is  evaluated  on  its  ability  to  detect  the  planted 
outliers  and  avoid  false  alarms.  Both  the  detection  capability  and  felse  alarm  rate  (shown 
in  parentheses  in  the  tables)  are  reported  for  500  replications.  Common  random  numbers 
ensure  that  each  procedure  evaluates  the  same  500  sets  of  data.  Section  3.4.1  describes 
the  e3q)eriment  designs,  results  and  performance  summaries  for  interior  X-space  regression 
outliers.  The  high-leverage  (exterior  X-space)  regression  outlier  studies  are  in  Section 


3.4.2. 
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3.4.1  Interior  X-space  R^ression  Outliers 

This  set  of  e^qjeriments  evaluates  the  ability  of  the  methods  to  identify  regression 
outliers  when  all  regressor  variable  values  are  not  unusual  in  X-space.  That  is,  there  are 
no  high-leverage  points  intentionally  planted  in  the  samples.  The  response  values  for  the 
interior  X-space  outljdng  observations  are  offset  a  distance  8r  from  the  regression  plane 
obtained  from  the  clean  cases.  There  are  three  studies  in  this  section  based  on  the 
configviration  of  the  outliers.  In  the  first  study,  the  multiple  outliers  are  randomly 
scattered  in  the  interior  of  X-space.  The  second  study  considers  multiple  point  clouds  or 
clusters  of  outliers  that  are  located  near  the  centroid  of  X-space.  The  third  study 
considers  multiple  point  clouds  randomly  placed  (different  for  each  replication)  in  the 
interior  of  X-space.  The  measures  of  performance  are  the  probability  of  detection  and  the 
probability  that  a  clean  observation  is  incorrectly  classified  as  an  outlier.  The  average 
value  of  these  probabilities  and  the  active  effects  from  the  analysis  of  variance  are 
displayed  in  the  last  rows  of  the  tables  to  provide  summary  information  on  the  techniques. 

The  direct  procedures  evaluated  in  these  studies  and  the  accompanying 
abbreviations  for  the  tables  of  results  are:  1)  the  Sebert  et  al.  clustering  algorithm 
(SM&R),  2)  the  Swallow  and  Kianiford  (S&K)  recursive  residual  algorithm,  3)  the  Pena 
and  Yohai  (P&Y)  influence  matrix  algorithm,  and  4)  the  Hadi  and  Simonoff  sequential 
point  addition  algorithm.  Both  the  original  Hadi  and  Simonoff  algorithm  (HS93)  and  the 
updated  version  (HS97)  are  considered.  The  selected  indirect  procedures  that  incorporate 
the  residuals  from  regression  estimators  are  OLS,  LMS,  LTS,  Mand  MM.  To  limit  the 
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scope,  residuals  from  compound  estimators  and  GM-estimators  are  not  considered  for 
these  studies  because  we  are  not  considering  high-leverage  points. 

3.4.1.1  Randomly  Scattered  Regression  Outliers  in  the  Interior  of  X-Space 

This  study  evaluates  performance  when  the  outliers  have  random  levels  of  the 
regressor  variables  with  the  same  distribution  as  the  clean  observations  but  the  response 
values  are  placed  at  a  specified  distance  6r  off  the  regression  plane.  The  response  to  the 
f  clean  case  is  generated  by  where  g  is  the  vector  of  known  regression 

coefficients  selected  for  the  simulations  to  be  0  for  the  intercept  and  5.0  for  each  of  the  k 
regressor  variables,  x,  is  the  vector  of  levels  for  the  k  regressor  variables  distributed 
multivariate  normal  with  mean  7.5  and  standard  deviation  4.0  and  is  the  random  error 

term  distributed  N(0,  a] )  wither^  set  to  1 .00.  The  response  to  the  f  outlying  observation 
is  generated  by  y,  =  xjg  +  Sj^  where  8r  is  the  outlying  distance  off  the  regression  plane  in 

standard  deviation  units  of  <Te.  The  design  in  Table  3.1  considers  dimension,  density  of 
outliers  (as  a  percentage  of  the  sample  size),  and  6r  as  the  effects.  The  probability  of 
correctly  identifying  the  known  outliers  and  also  the  frilse  alarm  probability  in  parentheses 
are  the  results  from  the  Monte  Carlo  simulations. 

The  OLS,  Mand  MM  regression  estimators’  detection  capability  stands  out  in  the 
resulting  probabilities  reported  in  Table  3.1 .  These  indirect  methods  are  the  only  ones 
with  any  power  at  a  magnitude  of  8r  =  3<Te  and  they  have  nearly  perfect  detection 
capability  at  and  beyond.  Although  it  has  excellent  detection  capability,  the  OLS 


Table  3.1.  Design  matrix  with  detection  capability  and  false  alarm  rates  (in  parentheses)  for  regression  outliers 
generated  from  random  levels  of  the  regressor  variables  from  the  interior  of  X-space.  _ 
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method  is  xinsatisfectory  because  it  swanks  clean  observations  as  indicated  by  the  high 
false  alarm  rate.  This  is  attributed  to  a  degradation  in  parameter  estimates  (that  worsen  as 


a  function  of  8r)  such  that  the  clean  data  are  no  longer  fit  well.  The  M-estimator  has  some 
difficulty  with  fiilse  alarms  in  high-density  scenarios  as  e)q)ected  and  the  MM  procedure 
has  a  slightfy  lower,  although  still  high,  fiilse  alarm  rate.  The  high-breakdown  methods  of 
LMS  and  LTS  are  preferred  for  outlying  magnitudes  at  4ae  and  beyond  because  of  the 
competitive  detection  probabilities  and  low  false  alarm  rates.  The  LMS  estimator  is 
slightfy  preferred  over  the  LTS  in  low-density  scenarios  and  the  opposite  is  true  for  the 
high-density  scenarios. 

For  the  direct  methods,  the  Pena  and  Yohai  method  is  significantly  outperformed 
by  all  other  techniques  at  these  outlying  distances.  Further  simulation  demonstrates  the 
algorithm  does  have  much  better  detection  capability  if  5r  is  greater  than  approximately 
Toe.  The  Sebert  et  al.  clustering  procedure  has  decent  detection  capability,  but  suffers 
from  a  large  false  alarm  probability  in  many  scenarios.  The  false  alarm  rate  for  both  Hadi 
and  Simonoff  versions  is  abnormally  low  and  the  detection  capability  is  also  low  at  4oe  and 
below.  This  presented  an  opportunity  to  increase  detection  capability  by  decreasing  the 
cutoff  value  from  the  t  distribution  based  on  a  Bonferroni  approach.  The  value  of  a  is 
increased  from  0.05  to  0.20.  The  results  in  Table  3.1  do  indicate  a  greater  detection 
capability  is  possible  without  severe  impact  to  felse  alarm  probabilities.  We  note  that  the 
original  Hadi  and  Simonoff  version  from  1993  has  nearly  identical  performance  to  the 
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improved  version  from  1997  for  a  =  0.05.  The  1993  version  moderately  outperforms  the 
improved  version  for  a  =  0.20. 

3.4.1.2  Regression  Outliers  in  Multiple  Point  Clouds  at  the  Centroid  of  X-Space 
This  study  evaluates  the  performance  of  the  procedures  when  there  are  multiple 
observations  forming  one  or  two  clusters  at  the  centroid  of  X-space  that  are  off  the 
regression  plane.  As  usual,  the  response  value  for  the  clean  observation  is  generated  by 
-  x'p  +  Si  where  p  is  the  vector  of  known  regression  coefficients  selected  for  the 

simulations  to  be  0  for  the  intercept  and  5.0  for  each  of  the  k  regressor  variables,  X/  is  the 
vector  of  levels  for  the  k  regressor  variables  distributed  multivariate  normal  whh  mean  7.5 
and  standard  deviation  4.0  and  is  the  random  error  term  distributed  N(0,  a-] )  wither^  set 
to  1.00.  The  response  value  for  the  outlying  case  is  generated  by  =  x'p  +  where 
X .  is  the  vector  of  k  regressor  variables  distributed  Uniform  (7.375,  7.625)  and  5r  is  the 
outlying  distance  off  the  regression  plane  in  standard  deviation  units.  If  there  are  two 
clouds,  the  response  values  for  the  outliers  in  the  first  cloud  are  generated  as  above  and 
the  second  cloud’s  response  values  are  generated  by  y^  —  x'^  —  5^ .  The  four  fectors  for 

this  experiment  are  dimension,  outlier  density,  outlying  distance  (5r),  and  the  number  of 
clouds.  The  design  and  results  are  displayed  in  Table  3.2.  For  this  particular  study,  the 
levels  of  8r  are  close  to  one  another  because  initial  experimentation  indicated  that  none  of 
the  procedures  had  detection  capability  below  3CTe  and  nearly  all  had  virtually  perfect 
detection  capability  at  50*  and  beyond. 


Table  3.2.  Design  matrix  with  detection  capability  and  false  alarm  rates  (in  parentheses)  for  regression  outliers  in  multiple 
point  clouds  at  the  centroid  of  X-space.  A  single  cloud  is  placed  8r  off  the  regression  surface. 
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The  methods  are  more  successful  at  detecting  these  outlying  observations  in  clouds 
at  the  centroid  of  X-space  compared  to  similar  scenarios  with  randomly  scattered  outliers 
in  Section  3.4. 1 . 1 .  Again,  the  OLS,  M  and  MM  indirect  methods  are  superior  in  detection 
capability.  OLS  has  problems  with  swanq)ing  if  there  is  a  single  cloud  for  the  reasons 
described  in  Section  3.4. 1.1.  However,  when  there  are  two  clouds,  there  is  no  swamping 
because  there  is  an  equal  and  opposite  “pull”  on  the  regression  surfece  from  each  cloud 
that  leaves  the  parameter  estimates  essentially  unchanged  from  those  obtained  with  clean 
observations  only.  A/ and  MA/have  nearly  identical  detection  and  false  alarm 
probabilities.  Except  for  the  two  highlighted  scenarios,  the  Hadi  and  Simonoff  1997 
updated  procedme  performs  as  well  as  or  slightly  better  than  the  original  1993  version. 

All  other  methods  have  consistent  results  with  Section  3. 4. 1.1. 

3.4.1.3  Regression  Outliers  in  Multiple  Point  Clouds:  Regressor  Variables 
Random^  Scattered  on  the  Interior  of  X-Space 

The  multiple  outlier  clouds  for  this  section  are  placed  at  different  locations  in  X- 
space  rather  than  the  centroid  for  each  replication.  The  location  of  the  regressors  for 
outlying  observations  in  a  single  point  cloud  is  determined  by  using  the  median  of  the  first 
three  clean  observations  for  each  variable.  The  regressor  variables  for  the  outlying 
observations  then  vary  as  Uniform  (0, 0.25)  aroimd  this  median  value.  Outliers  in  a 
second  cloud,  if  applicable,  vaiy  around  the  median  value  of  the  last  three  clean 
observations  in  each  variable.  Recall  that  each  regressor  variable  for  the  clean 
observations  is  distributed  N(7.5, 4^).  We  found  that  using  the  median  of  three 
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observations  provides  adequate  coverage  of  interior  X-space  and  that  more  than  three 
observations  tends  to  place  the  outlying  observations  too  close  to  the  centroid  of  X-space. 
The  response  values  are  foimd  exactly  as  in  the  previous  two  sections  with  three  levels  of 
6r  specified  as  the  outlying  magnitude.  The  factors  for  this  e3q)eriment  design  in  Table  3.3 
are  dimension,  contamination,  number  of  clouds  and  the  outl3dng  distance  5r  . 

The  results  in  Table  3.3  indicate  the  findings  are  consistent  with  the  first  two 
studies  except  this  scenario  is  more  challenging.  Most  main  effects  and  many  two  fector 
interactions  are  significant  for  detection  capability  except  for  the  high  breakdown 
regression  estimators.  The  least  squares  estimates  do  not  fit  the  outlying  cloud(s)  well  as 
evidenced  by  the  high  probability  of  detection;  however,  they  do  chase  these  observations 
enough  to  swanqi  some  clean  observations.  The  il/and  MM  estimators  have  moderately 
better  detection  probabilities  than  OLS  and  significantly  better  false  alarm  rates,  although 
well  above  nominal  levels  in  the  high-dimension,  high-density  scenarios.  The  high- 
breakdown  methods  are  not  impacted  with  high  felse  alarms  and  reliably  detect  the  outliers 
at  4ae  and  beyond.  Sebert  et  al.  is  no  longer  competitive  with  the  other  procedures 
because  of  a  consistent  high  fedse  alarm  problem  and  decreased  power.  Pena  and  Yohai 
has  slightly  better  performance  with  the  increased  leverage  for  these  outfying  clouds, 
although  still  not  con^etitive  with  any  other  procedure.  Both  Hadi  and  Simonofif 
procedures  have  very  low  false  alarm  rates  and  further  testing  demonstrates  substantial 
improvement  in  detection  capability  is  possible  if  a  is  increased  to  as  much  as  0.30. 
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3.4.1.4  Supplemental  Runs  of  Higher  Outlying  Distances  and  Outlier  Density 

This  study  examines  the  effect  of  increasing  the  factor  settings  for  8r,  the  outlying 
distance  off  the  regression  plane,  and  the  contamination  or  percentage  of  outliers.  The 
fector  level  for  6r  is  changed  from  the  usual  challenging  3  -  Sa*  range  to  a  low  level  of  Soe 
and  a  high  level  of  lOoe.  The  percentage  contamination  is  changed  to  15%  for  the  low 
level  and  30%  for  the  high  level.  The  first  two  scenarios  in  Table  3.4  are  randomly 
scattered  outliers  as  investigated  in  Section  3. 4. 1,1.  The  next  four  scenarios  have  multiple 
point  clouds  placed  at  or  near  the  centroid.  The  last  8  scenarios  form  a  2"*'*  fractional 
foctorial  with  the  outliers  placed  in  clouds  randomly  throughout  X-space  similar  to  those 
of  Section  3. 4. 1.3. 

These  runs  produce  some  rather  different  results  from  the  preceding  studies.  With 
the  contamination  set  higher,  we  now  see  more  clearly  that  the  OLS  and  M  estimators 
break  down  by  the  astronomical  folse  alarm  rates.  The  MM  estimator  reliably  detects  the 
planted  outliers  (the  one  exception  is  the  second  shaded  scenario).  The  original  Hadi  and 
Simonoff  procedure  is  superior  to  the  modified  version  for  these  scenarios.  The  first 
shaded  scenario  is  different  from  most  because  of  the  failure  of  the  high  breakdown 
regression  estimators  and  the  discrepancy  in  performance  between  the  two  Hadi  and 
Simonoff  versions.  All  procedures  foil  in  the  second  shaded  scenario.  In  most  instances, 
the  LMS,  LTS  and  MM  estimators  perform  the  best  when  accoimting  for  the  felse  alarm 
rates.  We  also  note  the  success  of  the  Pena  and  Yohai  procedure  for  the  low-density,  high 
outlying  distance  scenarios.  Surprisingly,  the  only  active  effect  in  this  operating  region  for 


Table  3.4.  Design  matrix  with  detection  capability  and  false  alarm  rates  (in  parentheses)  for 
high-magnitude,  high-density  runs  for  regression  outliers  in  the  interior  of  X-space. 
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detection  capability  from  the  54  fraction  design  is  the  percentage  of  outliers;  this  is  true 
only  for  the  direct  procedures. 
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3.4.2  Exterior  X-space  Regression  Outliers 

This  section  evaluates  a  method’s  ability  to  detect  observations  outlying  in  X- 
space  (high-leverage)  and  also  off  the  regression  plane  (residual  outliers).  The  same  direct 
procedures  are  evaluated.  We  change  the  indirect  methods  because  of  known 
vulnerabilities  of  the  A/ and  A/A/ estimators  in  high-leverage  situations.  The  indirect 
procedures  are  the  bounded  influence  generalized  A/-estimator  {GM)  and  the  compound 
robust  regression  estimators  of  Coakley  and  Hettmansperger  (CE  C&H)  and  Simpson  and 
Montgomery  (CE  S&M).  We  also  test  the  procedures  from  Rousseeuw  and  van  Zomeren 
(1990,  1991)  that  suggest  the  MVE  robust  distances  to  identify  observations  remote  in  X- 
space  and  LMS  standardized  residuals  to  find  regression  outliers.  The  original  proposal 
(R&vZ  chi)  uses  robust  distance  cutoff  values  from  percentiles  of  the  chi-square 
distribution  ( xljo.^s )  ^  thumb  cutoff  value  for  the  LMS  standardized  residuals 

(2.5).  Their  subsequent  recommendation  (R&vZ  sim)  uses  simulated  cutoff  values. 

The  first  study.  Section  3.4.2. 1,  has  multiple  point  clouds  at  various  leverage 
locations.  The  regressor  variable  values  are  remote  in  all  A  regressors  for  the  planted 
outliers  as  is  often  reported  in  the  literature.  Section  3. 4.2.2  is  similar  to  the  first  study 
but  ensures  the  response  values  for  the  outliers,  although  off  the  regression  surface,  are 
not  unusual  with  respect  to  the  clean  responses.  That  is,  the  regression  outliers  are  Y- 
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space  inliers.  We  next  investigate  in  Section  3.4.2.3  the  effect  if  the  outliers  are  not 
unusual  in  all  ^  regressor  variables  and  the  outlying  magnitude  is  increased.  The  last 
Section  3. 4.2.4  evaluates  the  performance  when  there  is  a  remote  cloud  in  X-space  of 
regression  outliers  and  also  other  regression  outliers  in  the  interior  of  X-space.  This  last 
experiment  looks  at  the  possibility  of  the  high-leverage,  large-magnitude  regression 
outliers  masking  the  low-leverage  smaller  magnitude  regression  inliers.  Throughout  all  of 
the  studies,  cutoff  values  and  other  internal  parameters  are  selected  for  each  procedure  to 
ensure  the  expected  false  alarm  rate  is  approximately  0.05  under  the  null  hypothesis  of  no 
outliers. 

3.4.2.1  Regression  Outliers  in  Clouds  that  are  Unusual  in  X-space  for  All  Regressors 
This  study  evaluates  performance  for  scenarios  with  high-leverage  multiple  point 
clouds  that  are  off  the  regression  plane.  The  scenarios  are  similar  to  those  used  by  Sebert 
et  al.,  Hadi  and  Simonoff,  and  Kianiford  and  Swallow.  The  regressor  and  response  values 
for  the  clean  observations  are  computed  as  described  in  Section  3.4.1.  The  value  of  the 
regressor  variable  for  the  /*  planted  outlier  is  x,y  =  e/am  +  8l  +  Sy  where  is  the 

average  of  the  clean  observations  for  the  regressor  variable,  6l  is  the  magnitude  of  the 
outlying  distance  in  X-space  in  standard  deviation  units,  a*,  and  e*  is  a  random  variate 

from  a  Uniform  (0, 0.25)  distributioa  In  this  section,  all  k  regressor  variable  values  for 
the  outlying  observations  are  generated  as  above.  A  multiple  point  cloud  is  placed  at  the 
edge  of  interior  X-space  when  the  leverage  magnitude  is  at  the  low  fector  setting  (8l= 
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2ax).  The  cloud  is  significantly  remote  in  X-space  for  the  high  fector  setting  of  leverage 
magnitude  (5l  =  Sgx).  If  there  are  2  clouds,  the  second  cloud  is  placed  at  approximately 
the  same  location  in  X-space  but  the  response  value  is  2ae  above  that  of  the  first  cloud. 

As  an  example,  for  k  =  2,  leverage  magnitude  5l  =  20x,  and  residual  magnitude  5r  =  5ae, 
the  regressor  variable  values  for  the  j**  outlying  observation  in  either  cloud  are  Xi  =  7.5  + 
2(4)  +  f  *  and  X2  =  7.5  +  2(4)  +  f  * .  The  response  value  for  the  f  outlying  observation  in 

the  first  cloud  is  calculated  as>',  =  5xu  +  5x2,  +  5  and  the  /"'response  value  in  the  second 
cloud  is  calculated  asyi  =  5xi,  +  5x2/  +  7.  The  factors  considered  for  this  experiment  are 
dimension,  outlier  density,  leverage  (6l),  outlying  distance  off"  the  regression  plane  (5r) 
and  the  number  of  multiple  point  clouds.  The  full  fectorial  2*  design  and  resulting 
measures  of  performance  are  displayed  in  Tables  5a  (single  cloud)  and  5b  (two  clouds).  A 
much  more  efScient  2v'  design  was  initially  run  but  many  interesting  factor  combinations 
were  missing. 

The  most  notable  feature  fi'om  Tables  3.5a  and  3.5b  is  the  lack  of  detection 
capability  for  many  of  the  methods  now  that  leverage  is  added  as  a  factor  for  6r,  =  3ae 
(the  top  half  of  both  tables).  These  methods  have  not  necessarily  failed  fi-om  one 
perspective  because  in  most  of  these  scenarios  the  outlying  clouds  do  not  breakdown  the 
OLS  parameters  (note  the  moderate  OLS  felse  alarm  rates).  However,  the  practitioner 
still  may  want  to  identify  these  cases  for  reasons  other  than  impact  to  estimation.  The 
Sebert  et  al.  and  Rousseeuw  and  van  Zomeren  procedures  do  have  detection  power  in 


these  scenarios. 


Table  3.5a.  Design  matrix  with  detection  and  false  alarm  probabilities  for  high-leverage 
(imusual  in  all  k  regressors)  regression  outliers  forming  a  single  cloud. 
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For  the  direct  methods  in  all  scenarios,  the  Sebert  et  al.  procedure  has  virtuaUy 
perfect  detection  capability  and  reasonable  felse  alarm  rates.  This  success  is  attributed  to 
a  favorable  clustering  condition  for  the  outliers  from  unusually  high  predicted  response 
values  coupled  with  near  zero  standardized  least  squares  residuals  because  of  the  leverage. 
The  expected  response  value  for  the  clean  observations  is  E(y)  =  5k*  7,5  and  the 
expected  response  for  the  outUers  is  5k  *  (7.5  +  45l)  +  .125  +  5r.  We  investigate  the 
algorithm’s  performance  if  the  predicted  response  values  are  not  unusual  (Y-space  inhers) 
in  the  next  section,  3.4.2.2.  The  other  direct  methods  do  not  fere  as  well.  Both  the  Hadi 
and  Simonoff  and  Swallow  and  Kianifard  methods  have  little  detection  capability  in  almost 
all  scenarios  because  they  sequentially  add  an  observation  to  the  clean  basis  as  a  function 
of  the  smallest  OLS  residual.  Clearly,  these  high-leverage  outliers  can  have  very  small 
OLS  residual  values  and  are  often  masked.  We  note  again  the  unusually  low  felse  alarm 
probabilities  for  both  Hadi  and  Simonoff  procedures  and  investigate  the  possibility  of 
relaxing  the  cutoff  values  from  the  t  distribution  in  Section  3.4.2.3.  The  Pena  and  Yohai 
algorithm  does  have  some  moderate  detection  capability  for  these  high-leverage  regression 
outlier  scenarios. 

The  Rousseeuw  and  van  Zomeren  methods  successfully  detect  the  outlying  clouds 
in  exterior  X-space  but  are  troubled  by  high  false  alarm  rates;  particularly  for  the  single 
cloud  scenarios  in  Table  3.5a.  The  simulated  cutoff  values  provide  slightly  less  outlier 
detection  capability  but  significantly  lower  false  alarm  rates  than  the  original  proposal. 

For  the  indirect  methods  with  regression  estimators,  the  generalized  M-estimate 
has  poor  detection  capability  because  of  the  breakdown  of  the  OLS  initial  estimate  and  the 
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hat  diagonal  leverage  component.  The  compound  estimators  have  reasonable 
performance  in  the  high  residual  distance  scenarios  in  Table  3.5b,  apart  from  the  high 
dimension,  high  contamination  scenarios.  The  folse  alarm  probability  moderately  exceeds 
the  nominal  5%  rate  for  both  compound  estimators  in  these  scenarios.  The  Simpson  and 
Montgomery  estimator  slightly  outperforms  the  Coakley-Hettmansperger  estimator. 

3.4.2.2  Regression  Outliers  in  Clouds  that  are  Unusual  in  X-Space  in  All  k 
Regressors  but  the  Outlier  Responses  are  not  Unusual  in  Y-Space 

This  study  investigates  the  effect  of  changing  geometry  in  X-space  such  that  the 
outlying  cloud  will  not  have  an  unusual  response  value  with  respect  to  the  responses  for 
the  clean  observations.  The  values  for  the  /'*  regressor  variable  for  the  outliers  are  now 
generated  as  +  45l  +  f  *  for  /=  1, 3,  5  and  -  46l  +  f  ‘  for  /  =  2, 4,  6.  This 

scheme  effectively  equalizes  the  expected  response  values  for  the  clean  and  outlying  cases. 
The  four  scenarios  in  Table  3.6  are  randomly  selected  from  those  in  Section  3.4.2.1  with 
the  regressor  variable  values  for  the  outliers  generated  as  described  above. 

The  results  in  Table  3.6  for  nearly  all  techniques  are  within  a  few  percentage  points 
for  both  detection  capability  and  folse  alarm  probabilities.  The  1997  Hadi  and  Simonoff 
procedure  has  significantly  lower  detection  capability  than  most  other  procedures  for  the 
first  scenario  and  significantly  higher  detection  capability  in  the  last  scenario.  In  contrast 
to  near  perfect  performance  in  the  previous  experiment,  the  Sebert  et  al.  procedure  fails  in 


Table  3.6.  Design  matrix  with  detection  capability  and  false  alarm  probabilities  (in  parentheses)  for  high-leverage 
regression  outliers  when  the  response  variable  is  not  unusual  in  Y-space  for  the  outlying  observations. 
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these  scenarios.  Not  only  is  detection  capability  low,  but  also  the  felse  alarm  probability 
is  high. 

3.4.2.3  Outliers  are  Unusual  in  X-space  in  a  Subset  of  Regressor  Variables  and 
Larger  Residual  Magnitude  (8r  )  Factor  Settings 

This  study  investigates  the  power  and  false  alarm  rates  for  the  procedures  when 
the  factor  settings  for  residual  magnitude  are  changed  from  6r  =  3oe  for  the  low  level  and 
5<Te  for  the  high  level  to  Sctc  and  lOwe  respectively.  The  number  of  clouds  is  set  at  one 
because  this  has  proven  to  be  the  more  challenging  configuration  for  these  procedures. 
The  number  of  unusual  regressor  variables  out  of  k  for  the  outliers  is  introduced  as  a 
frictor  with  the  low  level  as  1  and  the  high  level  as  2  for  k=1  and  3  for  k=  6.  We  believe 
this  to  be  a  more  likely  scenario  to  encoxmter  in  practice  as  opposed  to  finding  cases  that 
are  outlying  in  all  A:  variables.  Additionally,  the  regressor  variables  alternate  in  sign  as 
described  in  Section  3.4,2.2  to  guard  against  xmusually  large  response  values  for  the 
planted  outliers.  The  ejq)eriment  design  in  Table  3.7  is  a  2v* ;  the  two-fector  interactions 

are  not  aliased  with  main  effects  or  other  two-fector  interactions. 

These  scenarios  are  important  to  detect  because  of  significant  swamping  from  the 
OLS  fit  as  evidenced  by  the  high  false  alarm  rates  in  Table  3.7.  The  shaded  scenarios 
indicate  voids  where  all  procedures  fell  to  detect  the  outlying  cloud  in  high  dimension, 
high  contamination.  Noteworthy  results  for  the  direct  procedures  across  all  scenarios  are 


Table  3.7.  Half-fraction  design  matrix  with  detection  and  false  alarm  probabilities  (in  parenthesis)  for  large  regression 

_ outlying  distance.  The  outliers  are  not  necessarily  outlying  in  all  regressor  variables.  _ 

1  A  1  B  I  C  I  D  1  E  I  HS93  I  HS93  I  HS97  1  HS97  I  1  I  R&vZ  1  R&vZ  I  I  CE  I  CE  I 
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the  general  Mure  of  the  Swallow  and  Kianifard  recursive  residuals  procedure,  the  high 
&lse  alarm  rate  and  limited  detection  capability  of  the  Sebert  et  al.  clustering  procedure, 
and  the  improved  performance  (although  somewhat  limited  in  detection  capability  in  high- 
density  scenarios)  of  the  Pena  and  Yohai  influence  matrix  procedure.  The  original  Hadi 
and  Simonoflf  forward  selection  procedure  performs  better  than  the  improved  version  in 
jQ\;v-leverage  scenarios  and  the  opposite  is  true  for  exterior  X-space.  Relaxing  ct  to  0.20 
gains  very  little  in  detection  capability  for  both  versions  of  the  Hadi  and  Simonoff 
procedure  but  carries  the  risk  of  excessive  false  alarms  in  high-dimension  scenarios. 

Except  for  the  shaded  scenarios,  the  Rousseeuw  and  van  Zomeren  procedures 
have  near  perfect  detection  capability.  Both  suffer  from  high  felse  alarms,  although  the 
simulated  critical  value  procedure  has  slightly  lower  felse  alarm  rates.  The  robust 
regression  estimators  have  consistent  results  with  the  previous  sections:  1)  the  failure  of 
the  generalized  M-estimator  in  high-leverage  scenarios,  2)  the  Simpson  &  Montgomery 
estimator  slightly  outperforms  the  Coakley  &  Hettmansperger  estimator  and  3)  the 
compound  estimators  have  difficulty  with  some  exterior  X-space  scenarios. 

3.4.3  Interior  and  Exterior  X-Space  Outliers 

This  study  evaluates  the  performance  of  the  procedures  when  large  magnitude 
outliers  in  both  6l  and  6r  are  present  that  could  mask  the  lower  magnitude  outliers.  The 
first  four  scenarios  have  a  single  cloud  with  10%  of  the  observations  outlying  at  high- 
leverage  (8l  =  Sgx)  and  significantly  off  the  regression  plane  (8r  =  lOffe)-  Another  10%  of 
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the  observations  are  randomly  scattered  regression  outliers  at  the  interior  of  X-space  at  a 
magnitude  of  6r  =  4ae.  The  last  four  scenarios  place  the  interior  outliers  in  a  cloud 

approximately  at  the  centroid  of  X-space,  8r  =  4oe- 

The  original  Rousseeuw  and  van  Zomeren  procedure  has  the  best  detection 
probabilities  in  Table  3.8  and  also  has  false  alarm  rates  slightly  above  the  nominal  5% 
level.  The  next  best  performing  methods  are  the  compound  estimators,  also  with 
moderately  high  false  alarm  probabilities.  The  Sebert  et  al.  procedure  is  the  only  direct 
procedure  vvith  any  significant  detection  capability.  All  procedures  identify  the  outliers 
better  if  the  number  of  outlying  regressor  variables  is  one  because  of  the  decrease  in 
influence  exerted  fi’om  the  high-leverage  points. 

3.5  Procedure  Summary  and  Recommendations 

The  most  interesting  performance  characteristics  of  the  various  procedures  have 
been  noted  in  the  resuhs  for  each  study.  This  section  provides  a  summary  of  those  results 
by  procedure  and  discusses  the  powerful  and  vulnerable  areas  of  performance. 

3.5.1  Performance  Summary  of  Direct  Procedures 

Hadi  and  Simonoff.  Both  versions  are  powerful  in  all  of  the  experiments  in 
Section  3.4.1  when  the  regression  outUers  are  in  the  interior  of  X-space.  The  most 
notable  feature  in  these  scenarios  is  the  very  low  false  alarm  probability.  This  prompted  an 


Table  3.8.  Design  matrix  with  detection  and  false  alarm  probabilities  (in  parenthesis)  for  large 

_ magnitude  high  leverage  outliers  and  smaller  magnitude  low  leverage  outliers. _ 

n  C.  r  I  I  i  I  I  R&vZ  I  R&vZ  I  \  CE  I  CE 
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increase  of  a  =  0.05  to  a  =  0.20  to  compute  the  cutoff  value  from  the  t  distribution  using 
the  Bonferroni  approach.  Dramatic  increases  in  detection  probabilities  from  this 
enhancement  are  realized  in  our  selected  scenarios  accompanied  by  felse  alarm 
probabilities  well  below  the  nominal  5%  rate.  Detection  capability  moderately  declines  in 
the  high-dimension,  high-density  scenarios.  The  original  1993  version  outperforms 
(especially  at  a  =  0.20)  or  is  equivalent  to  the  robust  1997  version  in  virtually  all  of  the 
experiments  in  Section  3.4.1  because  an  initial  basis  of  robust  distances  does  little  when 
the  outliers  are  not  leverage  points. 

Overall  performance  noticeably  degrades  for  these  two  algorithms  as  leverage  is 
added  as  a  fector.  This  can  be  attributed  to  the  loss  of  signal  from  the  OLS  studentized 
residuals  and  scaled  prediction  errors.  Increasing  a  to  0.20  does  not  increase  detection 
capability  and  may  swamp  too  many  clean  observations  in  the  high-leverage  scenarios  of 
Section  3.4.2.  Also  in  these  scenarios,  indirect  methods  significantly  outperform  both 
versions  of  the  algorithm.  Detection  capability  is  increased  to  reliable  levels  if  the  outlying 
distance  off  the  regression  plane  (6r)  is  sufficiently  large  relative  to  the  leverage  6l.  In  the 
higher  leverage  scenarios,  the  robust  1997  algorithm  outperforms  the  original  algorithm. 

Swallow  and  Kianifard.  This  algorithm,  based  on  recursive  residuals  from  a  least 
squares  fit  using  a  robust  scale  estimate,  reliabfy  detects  regression  outliers  in  the  interior 
of  X-space  at  5r  =  4ae  and  beyond.  High-dimension,  high-density  scenarios  affect 
detection  capability.  The  detection  capability  of  this  algorithm  is  also  highly  sensitive  to 
leverage  and  it  lags  behind  the  other  procedures  for  the  regression  outliers  unusual  in  X- 
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space  studies.  Despite  the  lack  of  power  in  these  scenarios,  the  false  alarm  rate  rarely 
exceeds  the  nominal  5%  rate  anywhere. 

Pena  and  Yohai.  Of  the  direct  procedures,  the  Pena  and  Yohai  algorithm  with  the 
eigenanalysis  of  the  influence  matrix  may  be  the  most  versatile.  Although  it  does  not 
detect  regression  outliers  in  interior  X-space  untfl  5-7CTe,  it  does  detect  the  high-leverage 
regression  outliers  reasonably  well.  Also,  the  procedure  rarely  swamps  clean 
observations.  The  scenarios  presented  in  this  paper  are  challenging;  however,  in  practice, 
the  scenarios  of  interest  may  have  magnitudes  of  the  outl3dng  distances  (6l  and  6r  )  large 
enough  to  effectively  use  this  procedure. 

Sebert  et  al.  The  clustering  algorithm  of  the  least  squares  standardized  predicted 
and  residual  values  is  often  the  only  procedure  with  any  detection  capability  at  all.  For 
this  method  to  be  successful,  a  signal  has  to  come  from  one  or  both  of  these  quantities.  In 
the  scenarios  of  Section  3.4.1,  the  signal  comes  fi-om  the  standardized  residual  values  only 
and  the  procedure  is  competitive  here  with  the  others  in  detection  capability  but  has  false 
alarms  rates  often  2  to  3  times  the  nominal  level.  As  the  outlying  distance  off  the 
regression  sur&ce,  5r,  increases  the  false  alarm  rate  decreases  and  the  detection  capability 
increases.  The  procedure  works  especially  well  for  the  exterior  X-space  outliers  in 
Section  3.4.2  if  the  predicted  response  values  for  the  outliers  are  unusual  with  respect  to 
the  clean  response  values.  The  standardized  least  squares  residuals  of  the  outliers  are 
often  close  to  0  in  the  high-leverage  scenarios  so  the  outlying  clusters  must  form  in  the 
algorithm  fi'om  unxisual  predicted  values.  In  Section  3. 4.2.2  we  show  that  the  method  is 
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vulnerable  in  high-leverage  scenarios  if  the  outliers  are  not  unusual  in  predicted  value.  If 
the  assumption  that  the  data  will  be  unusual  in  at  least  one  of  the  two  measures  is  met,  this 
is  a  veiy  powerful,  yet  easify  implemented  algorithm. 

3.5.2  Performance  Summary  of  Indirect  Procedures 

High  breakdown  estimators.  Both  LMS  and  LTS  detect  regression  outliers  in  the 
interior  of  X-space  well  if  the  outlying  distance  6r  is  at  least  4ae.  Both  estimators  also 
have  low  felse  alarm  rates  across  all  scenarios  when  6r  is  at  least  4ae,  unlike  many  of  the 
other  regression  estimators  evaluated  in  Section  3.4.2.  One  notable  exception  is  the  high 
felsfi  alarms  rates  in  the  scenarios  of  Table  3.4  in  high  dimension  with  30%  outlier  density. 
Overall,  LMS  detects  the  outliers  in  high  dimension  slightly  better  than  LTS.  Detection 
capability  decreases  for  both  estimators  as  the  outliers  become  more  remote  in  X-^ace. 
Although  not  tested  separately  for  regression  outliers  remote  in  X-space,  theory  and  our 
pilot  studies  showed  LMS  and  LTS  to  have  significant  masking  and  swamping  problems  in 
high-leverage  cases. 

The  Rousseeuw  and  van  Zomeren  technique  that  combines  the  robust  distances 
fi-om  the  MVE  with  the  scaled  residuals  fi-om  an  LMS  fit  is  one  of  the  better  performing 
techniques  for  exterior  X-space  regression  outliers.  Again,  we  prefer  the  simulated  cutoff 
values  to  protect  against  swamping  too  many  observations.  The  weak  areas  are  liimted  to 
high-dimension,  high-density. 

M  and  MM  estimators.  These  estimators  are  only  evaluated  in  the  regression 
outliers  in  the  interior  of  X-space  of  Section  3.4.1  because  they  are  known  to  fail  in  the 
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high-leverage  experiments  of  Sections  3.4.2.  The  results  from  the  first  three  experiments 
in  Section  3.4.1  indicate  that  the  detection  capability  and  false  alarm  rate  of  the  MM 
estimator  is  onty  sUghtly  preferred  over  the  M  estimator.  For  these  scenarios,  both 
estimators  have  excellent  detection  power  but  have  moderate  felse  alarm  problems.  The 
high  outlying  magnitude,  high-density  runs  in  Section  3. 4. 1.4  demonstrate  the  superiority 
of  the  MM  estimator.  Despite  the  comparable  detection  probabilities,  the  M-estimator 
breaks  down  and  suffers  from  severe  false  alarm  rates  in  these  scenanos. 

GM  and  compound  estimators.  The  standard  generalized-M  bounded-influence 
estimator  is  plagued  by  a  higher  frilse  alarm  rate  and  lower  detection  capability  than  the 
compoimd  estimators  in  the  exterior  X-space  regression  outlier  experiments  in  Section 
3.4.2.  This  effect  is  most  evident  in  the  high-density  scenarios  where  both  the  OLS  initial 
estimate  and  the  hat  diagonal  component  for  leverage  breakdown  for  this  estimator.  The 
conpoimd  estimators  do  a  decent  job  identifying  the  high-leverage  multiple  outliers.  Both 
the  Simpson  and  Montgomery  and  Coakley  and  Hettmansperger  estimators  have  similar 
detection  capability  and  moderately  high  frilse  alarm  probabilities  in  many  scenarios.  In 
several  exterior  X-space  scenarios,  only  the  compoimd  estimators  and  the  Rousseeuw  and 
van  Zomeren  procedure  successfully  detect  these  outliers.  From  Table  5a,  both  have  little 
detection  capability  with  moderate  leverage  despite  relatively  large  residual  magnitudes. 
This  presents  a  research  opportunity  that  will  be  explored  in  Chapter  4. 
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3.5.3  Summary  of  Results 

The  simulation  experiments  in  this  paper  validate  many  of  ejqiected  performance 
characteristics  of  the  multiple  outlier  detection  methods.  As  a  general  rule,  the  detection 
methods  perform  better  in  lower  dimension,  lower  outlier  density,  smaller  outlying 
leverage  distance,  larger  outlying  residual  distance,  and  larger  number  of  multiple  point 
clouds.  However,  we  show  scenarios  where  this  is  not  the  case  for  all  methods  and  all 
fectors.  Some  fectors  are  shown  to  be  either  not  significant  or  behave  opposite  to  the 
general  rule.  The  most  important  findings  suggest  that  limited  studies  in  low  dimension 
of  a  proposed  procedure  are  not  sufficient  to  speculate  on  its  performance  in  higher 
dimension— especially  if  the  percentage  outliers  is  large.  From  the  interior  X-space 
studies  of  Section  3.4.1,  the  Mgh-breakdown  methods  perform  well.  MM  performs  the 
best  overall.  The  1993  version  of  the  Hadi  and  Simonoff  algorithm  can  be  recommended 
if  the  residual  outlying  distance  is  large.  For  the  exterior  X-space  studies  in  Section  3.4.2, 
the  compound  estimators  and  the  robust  distance  with  high-breakdown  estimator 
procedures  perform  the  best.  The  Sinq)son  and  Montgomery  estimator  and  the 
Rousseeuw  and  van  Zomeren  method  with  simulated  cutoff  values  show  the  best  results  in 


our  studies. 


Chapter  4 

An  Improved  Robust  Regression  Compound  Estimator 


4.1  Introduction 

Barnett  and  Lewis  (1994)  define  outliers  as  observations  that  appear  inconsistent 
with  the  remainder  of  the  data  set.  In  the  linear  regression  model,  we  consider  three 
classes  of  outliers:  1)  residual  or  regression  outliers,  whose  response  values  differ 
significantly  from  those  ejqiected  from  the  fit  with  uncontaminated  data,  2)  leverage 
outliers,  whose  regressor  variable  values  are  extreme  in  X-space  and  3)  observations  that 
are  both  residual  and  leverage  outliers.  A  single  outlier  in  an  ordinary  least  squares 
(OLS)  regression  model  could  be  placed  to  alter  the  parameter  estimates  such  that  the  fit 
to  the  remaining  n—\  data  points  is  poor.  Fortunately,  many  standard  least  squares 
regression  diagnostic  quantities  and  plots  can  reliably  identify  a  single  or  a  few  of  these 
three  types  of  outliers.  One  modeling  approach  in  the  presence  of  outliers  is  to  remove 
the  discordant  observations  from  the  model  and  fit  the  remaining  observations.  Robust 
regression  estimators  offer  an  alternative  between  removing  the  outliers  and  including 
them  in  the  model  by  weighting  each  observation  as  a  function  of  “outlyingness”. 

Numerous  robust  regression  estimators  exist.  It  is  generally  accepted  that  no 
single  estimator  optimally  protects  against  all  outlier  scenarios  likely  to  be  encountered  in 
practice.  The  properties  of  a  good  robust  regression  estimator  are  1)  high-breakdown,  2) 
efficient  and  3)  bounded-influence.  High-breakdown  estimators  can  fit  a  model  to  the 
bulk  of  the  data  even  if  a  large  percentage  of  outliers  (as  much  as  50%  for  some 
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estimators)  are  present.  Least  squares  has  a  breakdown  of  0%  because  a  single  outlying 
observation  can  be  placed  in  a  data  set  that  makes  the  parameter  estimates  and  mferences 
for  the  remaining  n  -  \  observations  meaningless.  An  efficient  estimator  provides 
parameter  estimates  close  to  those  from  an  OLS  (the  best  linear  unbiased  estimator)  fit  in 
an  uncontaminated  sample  with  NID  error  terms.  Bounded-influence  estimators  protect 
the  regression  surface  from  being  pulled  toward  extreme  observations  in  X-space.  OLS 
estimators  do  not  have  bounded-influence  and  the  more  extreme  the  outlier  is  in  X-space, 
the  greater  the  impact  it  has  on  the  parameter  estimates. 

Theoretical  and  simulation  results  in  the  literature  show  that  many  robust 
regression  estimators  are  vulnerable  with  respect  to  at  least  one  of  the  three  desirable 
properties.  For  example,  the  common  high-breakdown  estimators  suffer  from 
inefficiency  and  unbounded-influence  while  many  efficient  techniques  are  not  high- 
breakdown  nor  bounded-influence.  Muki-staged  techniques  have  been  proposed  to 
combine  several  of  the  properties  into  a  single  estimator.  There  exist  multi-staged 
compoimd  and  generalized  M-estimators  {GM)  with  all  three  properties  that  can 
accommodate  data  sets  with  all  three  classes  of  outliers. 

Compoimd  and  GM  estimators  downweight  outlying  observations  by  minimizing 
a  function  of  the  residuals  rather  than  the  sum  of  the  squared  residuals  (OLS).  Parameter 
estimates  are  obtained  by  solving  a  system  of  nonlinear  normal  equations.  The  normal 
equations  incorporate  a  leverage  measure  to  accommodate  high-leverage  points  and  a 
robust  measure  of  scale.  An  iteration  scheme  to  solve  the  normal  equations  requires  good 
initial  parameter  estimates;  these  are  often  from  a  high-breakdown  estimator.  A 
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compound  estimator  uses  only  a  single  iteration  to  solve  the  nonlinear  normal  equations 
to  preserve  the  high-breakdown  property  (Simpson,  Ruppert  and  Carroll,  1992,  and 
Yohai,  1997).  GM  estimators  use  a  fully  iterated  scheme  and  have  a  breakdown  of  Mp. 

There  have  been  some  empirical  performance  studies  of  robust  regression 
estimators  (Simpson  and  Montgomery,  1998a,  1998b,  Wilcox,  1997,  and  Meintanis  and 
Donatas,  1997).  The  best  performing  estimators  with  respect  to  breakdown,  bounded- 
influence,  efficiency  and  robustness  to  outlier  scenarios  appear  to  be  the  compound 
estimators  of  Coakley  and  Hettmansperger  (1993)  (C&H)  and  Simpson  and  Montgomery 
(1998a)  (S&M).  This  chapter  proposes  several  compound  estimators  with  alternative 
high-breakdown  initial  estimators  and  measures  of  leverage  and  recommends  a  single 
method. 

Section  4.2  explains  GM  and  compound  estimators.  An  example  in  Section  4.3 
exposes  some  vulnerabilities  in  the  measures  of  leverage  and  initial  estunates  for 
published  compoimd  estimators.  Section  4.4  is  an  extensive  Monte  Carlo  performance 
study  of  some  measures  of  leverage.  Section  4.5  incorporates  the  best  performing 
measure  of  leverage  from  4.4  and  develops  the  need  for  a  better  initial  estimator.  Section 
4.6  tests  several  common  and  proposed  initial  high-breakdown  estimators.  We  propose  a 
new  compoimd  estimator  in  Section  4.7,  conduct  performance  studies  in  Section  4.8,  and 
summarize  results  in  Section  4.9. 
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4.2  Compound  Estimators  in  Linear  Regression 

The  standard  linear  regression  model  is  y  =  Xp  +e  where  y  is  the  observed 
response  vector  of  dimension  n,  the  number  of  observations;  X  is  the  observed  nxp 
matrbc  of  regressor  variables  with  intercept;  P  is  the  vector  of  regression  parameters,  and 
E  is  the  column  vector  of  n  random  errors  assumed  to  have  mean  0  and  covariance  matrix 
0^1.  GM-estimators  were  offered  as  improvements  to  their  predecessor  M-estimates 
(maximum  likelihood)  to  protect  against  high-leverage  outliers.  Rather  than  minimize 
the  sum  of  squared  errors  as  the  objective,  the  Jl/-estimate  minimizes  a  function  p  of  the 

errors.  The  M-estimate  objective  is  min  —  =  niin  X  P  - -  where  5  is  an 

^  J  ^  1=1  s  J 

estimate  of  scale  often  formed  fi-om  a  linear  combination  of  the  residuals.  The  system  of 
normal  equations  to  solve  this  minimization  problem  is  found  by  taking  partial 

n  f  y  _ 

derivatives  with  respect  to  p  and  setting  them  equal  to  0,  yielding  ^  y/  — - — 

/=i  V  ^ 

where  y/  is  the  derivative  of  p . 

The  choice  of  the  y/  -function  is  based  on  the  preference  of  how  much  weight  to 
assign  outliers  (see  e.g.  Montgomery  and  Peck,  1992).  A  monotone  ^-fiinction  does  not 
weight  large  outliers  as  much  as  least  squares  (e.g.  a  IOct  outlier  would  receive  the  same 
weight  as  a  3a  outlier).  A  redescending  y/  -function  increases  the  weight  assigned  to  an 
outlier  until  a  specified  distance  (e.g.  3a)  and  then  decreases  the  weight  to  0  as  the 
outlying  distance  gets  larger. 
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the  GM  objective  function  is  Schweppe  (Handsin  et  al,  1975).  If  the  7i-weights  are  not 
inside  the  argument,  then  the  GM  objective  function  is  Mallows  (Mallows,  1975).  In 
practice,  the  distinction  between  the  two  objective  functions  is  that  Mallows  will 
downweight  high-leverage  points  independently  of  the  residual  value  while  Schweppe 
will  not  downweight  if  the  response  value  conforms  to  the  regression  surface.  Thus, 
Mallows  does  not  incorporate  “good  outliers”  in  the  parameter  estimates.  Several 
approaches  to  forming  the  7i-weights  use  a  distance  measure  from  either  the  hat  diagonal 
(a„  =  x,(X'X) ‘x;),  A/-estimates  of  covariance,  the  minimum  volume  ellipsoid  (MVE)  or 
the  minrnnun  covariance  determinant  (MCD).  These  methods  and  some  proposed 
methods  are  described  in  Section  4.4. 

A  numerical  optimization  scheme  is  required  to  solve  the  GM  system  of  nonlinear 
normal  equations.  The  two  most  common  approaches  are  Newton’s  method  and 
iteratively  reweighted  least  squares  (IRLS).  Both  approaches  require  initial  parameter 
estimates  for  3  .  Most  initial  estimators  are  selected  to  provide  decent  parameter 
estimates  in  the  presence  of  a  large  percentage  (as  much  as  50%  in  some  cases)  of 
outliers.  The  popular  choices  for  these  high-breakdown  initial  estimates  are  the  least 
median  of  squares  (Rousseeuw,  1984)  (LMS),  least  trimmed  sum  of  squares  (Rousseeuw, 
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1985)  (LTS).  and  S-estimation  (Rousseeuw  and  Yohai,  1984).  The  final  parameter 
estimates  from  the  optimisation  routine  ean  come  from  a  fully  iterated  solution  (GM- 
estimator)  or  only  a  single  iteration  (eompound  estimator).  The  single  iteration  method 

preserves  the  breakdown  of  the  initial  estimator. 

Simpson  and  Montgomery  (1998b)  test  several  GM  and  compound  estimators  and 

find  that  the  Simpson  and  Montgomery  (1998a)  estimator  and  the  Coakley  and 
Hettmansperger  (1993)  estimator  have  good  overaU  performance.  The  S&M  estimator 
uses  an  5-estimate  that  minimizes  the  dispersion  of  the  residuals  for  both  the  initial 
parameter  estimates  and  the  measure  of  scale.  Other  components  are  modified  M- 
estimates  of  covariance  distances  to  form  the  rt-weights,  a  Schweppe  GM  objective 
function,  a  redescending  Tukey  -function  and  a  one-step  reweighted  least  squares 
convergence  criteria.  The  C&H  estimator  uses  an  LTS  initial  estimate,  an  LMS  estimate 
of  scale  (the  initial  estimate’s  scaled  median  residual),  robust  distances  fixim  an  MVE 
estimator  for  the  n-weight  component,  a  monotone  Huber  -function  and  solves  the 
normal  equations  with  a  single  iteration  of  a  Newton  algorithm. 


4.3  Compound  Estimator  Example 

Consider  creating  a  regression  data  set  of «  =  60  observations  and  k  =  6  regressor 
variables  with  12  high-leverage  residual  outliers.  The  outliers  are  remote  m  X-space 
because  the  values  for  their  first  two  of  six  regressors  are  5  standard  deviations  above  the 
mean  of  the  clean  regressor  variable  values.  The  response  values  for  these  outUers  are  10 
standard  deviations  away  from  the  regression  surfece  defined  by  the  fit  from  the  clean  48 
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cases.  Table  4.1  describes  how  the  regressor  and  response  variables  are  generated  for 
both  the  clean  and  outlying  cases.  The  last  12  cases  in  the  data  set  (shown  in  Appendix 
B)  are  the  planted  outliers. 


Tab 


to  6. 


Case 

Xy  7=1,2 

Xij  7  =  3, 4, 5,  6 

yi 

■BH 

1-48 

HQlIiSEmi 

EQIlQin 

49-55 

27.5  +  UNIF(0,1) 

0 

29.5  +  UNIF(0,1) 

M  iiill  lill'M 

0 

Because  there  is  a  large  percentage  of  high-leverage  points,  a  GM  or  compound 
estimator  is  likely  to  be  our  best  choice  to  accommodate  the  outliers.  We  choose  the 
S&M  and  C&H  estimators.  Both  estimators  erroneously  fit  the  12  outliers  (residuals  near 
0)  and  assign  weights  of  nearly  100%  to  these  observations.  By  chasing  the  outliers,  the 
fit  for  the  48  clean  cases  is  degraded.  Many  clean  cases  (8  for  S&M  and  7  for  C&H)  now 
have  large  residuals  that  a  researcher  could  erroneously  label  as  outliers.  The  mean 
squared  error  (MSe)  for  the  48  clean  cases  using  the  S&M  and  C&H  parameter  estimates 
is  more  than  three  times  the  MSe  obtained  by  a  least  squares  fit  to  the  clean  data.  Another 
problem  is  that  the  rc-weights  for  these  high-leverage  observations  are  not  imusual.  If  the 
contamination  is  reduced  in  this  example  to  10%  from  20%  or  if  the  leverage  distance  is 
reduced  to  2  standard  deviations  above  the  mean  from  5,  for  example,  the  outlying 
observations  are  correctly  downweighted  for  both  S&M  and  C&H. 

From  the  performance  studies  in  Chapter  3,  the  S&M  and  C&H  estimators  are 
successful  across  a  variety  of  outlier  scenarios,  but  are  vulnerable  (as  are  all  techniques) 
in  the  high-leverage,  high-density,  high-dimension  scenarios  of  this  example.  A  possible 
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solution  to  the  problem  is  to  find  better  estimates  of  leverage  because  the  rr-weights  are 
not  providing  any  indication  of  unusual  geometry  in  X-space. 

4.4  A  Performance  Study  for  Measures  of  Leverage 

This  section  describes  several  measures  of  leverage  fi'om  the  literature  that  can  be 
used  to  form  Jt-weights.  Monte  Carlo  simulations  using  factorial  designs  with  factors 
thought  to  impact  performance  provide  a  comprehensive  test  of  each  procedure  across 
numerous  X-space  conditions.  The  goal  is  to  possibly  improve  an  existing  GM  or 
compound  estimator  by  finding  a  technique  or  a  combination  of  techniques  that  performs 
well  in  most  scenarios  likely  to  be  encountered  in  practice. 

The  standard  measure  of  leverage  in  OLS  is  the  hat  diagonal  element.  This 
quantity  is  often  used  as  a  measure  of  “outlyingness”  in  X-space  and  is  extensively  used 
in  influence  diagnostic  quantities.  Remote  observations  in  X-space  may  exert  enough 
influence  on  the  least  squares  estimates  to  make  them  quite  different  fi’om  those  obtained 
with  only  the  observations  in  the  interior  of  X-space.  Some  GM  estimators  (e.g.  Walker, 
1984)  incorporate  the  hat  diagonal  measures  of  remoteness  in  X-space  to  accommodate 
these  outlying  observations.  The  hat  diagonal  measure  may  not  provide  an  adequate 
leverage  measure  when  there  is  even  a  moderate  number  of  outliers  in  X-space  present 
because  the  covariance  matrix  estimate  is  significantly  influenced  or  “pulled”  toward  the 
outliers.  For  the  data  set  of  Example  4.1  using  only  the  first  49  observations,  the  outlier 
(observation  49)  has  a  hat  diagonal  value  of  0.60  which  exceeds  the  usual  cutoff  (3p/«)  of 
0.42.  However,  for  the  full  data  set,  the  hat  diagonals  are  not  at  all  unusual  for  the  12 
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outliers  because  the  outliers  have  significantly  altered  the  covariance  matrix.  Therefore, 
the  hat  diagonal  has  broken  down  in  the  presence  of  multiple  outliers  and  does  not 
provide  a  reliable  measure  of  leverage. 

High-breakdown  measures  of  leverage  have  been  proposed  that  use  a  robust 
measure  of  the  mean  and  covariance  matrix  in  the  standard  Mahalanobis  distance 
computation,  JOf  =(x,  -x)'S‘’(x,  -x)  where  is  the  kx  1  vector  of  observations,  x  is 

the  mean  vector  of  X  and  S  is  the  kxk  sample  covariance  matrix.  The  robust  estimates 
of  the  mean  and  covariance  matrix  are  the  classical  mean  and  covariance  estimates 
computed  using  a  subset  of  the  data  assumed  to  be  outlier  free.  The  leverage  methods  we 
test  are  robust  distances  from  the  MVE,  MCD,  the  M-estimates  of  covariance  and  the 
Rocke  and  Woodruff  (1996)  (R&W)  hybrid  estimator.  We  also  investigate  the  Hadi 
(1992, 1994)  forward  search  algorithm  and  the  Sebert,  Montgomery  and  Rollier  (1998) 
(SM&R)  clustering  algorithm  that  can  detect  multiple  outliers.  We  also  consider  the 
usual  hat  diagonal  measure  that  is  equivalent  to  the  Mahalanobis  distance  apart  from  a 
few  constants. 

4.4.1  Method  Description 

The  Hadi  (1992,  1994)  forward  search  algorithm  on  robust  distances.  This 
algorithm  forms  the  initial  basis  ofp  +  1  clean  observations  from  the  minimum  robust 
distances.  The  robust  distance  measure  is  the  Mahalanobis  distance  confuted  with  the 
median  vector  and  covariance  matrix  based  on  the  median  rather  than  the  mean.  The 
initial  basis  is  sequentially  increased  to  size  h  =  (n  +  p  +  \)/2hy  adding  the  observation 
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with  the  least  robust  distance  calculated  from  the  mean  vector  and  covariance  matrix  of 
the  current  basis.  Next,  the  basis  is  sequentially  increased  by  the  case  with  the  lowest 
robust  distance  using  the  mean  vector  and  a  corrected  covariance  matrix  from  the  current 
basis.  If  the  lowest  robust  distance  exceeds  Xp,a/n  ^  then  all  observations  not  in  the  current 

basis  are  declared  outliers.  We  use  the  author’s  S-Plus  code. 

M-estimates  of  covariance.  Hampel  (1973)  first  suggested  M-estimates  of 
covariance,  but  the  basic  paper  on  these  estimators  is  attributed  to  Maronna  (1976). 
Maronna  addressed  the  problems  of  existence,  uniqueness,  asymptotic  distribution  and 
breakdown  point  for  these  estimators.  We  are  interested  in  the  distances  in  X-space  for 
each  observation  defined  by  2  =  A(x  —  t)  where  A  is  an  estimate  of  the  pxp  multivariate 

scatter  matrix  and  t  the  multivariate  location  vector.  Note  that  (A'A)“  is  the  estimate  of 
the  covariance  inatrix  of  X.  From  Huber  (1981),  the  maximum  likelihood  estimate  of  A 
and  t  is  determined  by  solving  the  simultaneous  equations 

ave{w(l2l)z}  =  0 
ave({«lz|)z2‘^  -  v(|2|)Ip}  =  0 

where  u,  v  and  w  are  arbitrary  weight  functions  and  ave{*}  is  the  average  taken  over  the 
sample.  We  solve  these  equations  using  the  Newton  algorithm  and  Huber  weight 
functions  with  the  associated  constants  and  correction  factors  as  defined  in  the  ROBETH 
library  accessed  by  S-Plus  (Marazzi,  1993).  An  observation  is  declared  an  outlier  in  X- 
space  if  the  distance  z  exceeds  the  95*’’  percentile  of  simulated  (1000  replicates)  distances 
under  the  null  hypothesis  of  no  outliers  for  a  specified  n  and  p.  Chapter  2  contains  a 
detailed  discussion  of  the  algorithm. 
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MVE  and  MCD  estimators.  The  MVE  estimate  of  the  mean  is  the  center  of  the 
smallest  ellipsoid  covering  at  least  half  of  the  observations.  The  estimate  of  the 
covariance  matrix  is  determined  from  these  cases  along  with  a  correction  factor  for 
consistency  at  multivariate  normal  distributions.  The  MCD  is  the  set  of  just  over  half  of 
the  observations  with  the  minimum  covariance  matrix  determinant.  Cutoff  values  for 
robust  distances  from  simulation  (1000  replicates)  determine  whether  an  observation  is 
classified  as  an  outlier  or  not.  There  are  numerous  algorithms  to  find  the  MVE  and  MCD 
that  provide  widely  varying  results  for  the  same  data  set;  we  use  the  recently  modified 
genetic  algorithms  internal  to  S-Plus  4.5  (Bums,  1992). 

The  R&W  (1996)  hybrid  procedure.  Rocke  and  Woodruff  combine  several  results 
in  the  literature  in  their  complex  two-phase  algorithm  to  detect  multiple  outliers.  The 
output  of  the  first  phase  is  an  estimate  of  multivariate  location  and  shape.  This  robust 
estimate  is  determined  by  first  partitioning  the  data  equally  into  cells  to  minimize  the 
impact  on  computational  complexity.  Within  each  cell,  the  observations  from  the  MCD 
using  Hawkins  (1993)  steepest  descent  algorithm  with  random  restarts  provide  the 
starting  point  for  a  sequential  point  addition  algorithm  from  Hadi  (1992).  This  result  is 
then  used  as  a  starting  point  for  the  translated  bi- weight  M-estimation  of  the  mean  and 
covariance  matrices. 

The  second  phase  runs  a  simulation  to  determine  the  appropriate  cutoff  value  to 
classify  observations  as  outliers  based  on  n  observations  in  p  dimensions  using  clean 
multivariate  normal  data  in  the  Phase  I  algorithm.  To  increase  efficiency,  new  location 
and  shape  matrices  are  formed  from  the  set  of  observations  below  the  simulated  cutoff 
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value.  The  robust  distance  is  calculated  using  these  new  location  and  shape  matrices  and 
compared  to  a  critical  value  to  classify  the  observation  as  outlying  or  not.  The 

authors  provided  their  compiled  C++  code. 

The  SM&R  (1998)  clustering  algorithm.  This  approach  uses  a  single-linkage 
clustering  algorithm  with  Euclidean  distances  on  the  standardized  predicted  and 
standardized  residual  values  from  a  least  squares  fit.  The  algorithm  finds  the  single 
largest  cluster,  or  the  bulk  of  the  data,  and  classifies  it  as  the  clean  observations. 

Mojena’s  stopping  rule  forms  the  final  clusters  by  splitting  a  cluster  tree  at  the  average  of 
the  w  -  1  tree  cluster  heights  (a  measure  of  cluster  separation)  plus  1.25  times  the  standard 
deviation  of  the  tree  cluster  heights. 

4.4.2  Monte  Carlo  Simulation  Leverage  Study 

We  conduct  a  performance  study  that  tests  the  ability  of  the  previously  described 
methods  to  identify  high-leverage  observations  across  a  variety  of  scenarios.  The  key  to 
a  good  leverage  measure  is  to  develop  an  estimate  of  the  mean  vector  and  covariance 
matrix  that  is  not  influenced  by  outliers.  This  suggests  that  we  do  not  want  outlying 
observations  included  in  the  calculations  of  the  parameter  estimates  and  also  that  we  want 
as  many  clean  observations  included  as  possible.  An  observation  is  masked  if  it  is  truly 
an  outlier  but  the  procedure  does  not  detect  it  and  an  observation  is  swamped  if  the 
procedure  identifies  it  as  an  outlier  when  it  is  a  clean  observatioiL  The  primary  measures 
of  performance  are:  1)  the  probability  that  an  outlying  observation  is  detected  and  2)  the 
probability  that  a  known  clean  observation  is  identified  as  an  outlier.  Note  that  the 
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masking  probability  is  the  complement  of  the  first  measure  of  performance  and  the 
swamping  probability  is  the  second  measure  of  performance. 

The  simulation  scenarios  place  a  cluster  or  two  clusters  of  several  observations  at 
a  specified  location  shifted  in  X-space  because  this  geometry  challenges  the  procedures 
(Rocke  and  Woodruff,  1996)  The  factors  investigated  in  these  studies  are  dimension, 
density  (percentage  of  outlying  observations),  the  number  of  standard  deviations  fi-om  the 
mean  that  the  cloud  is  placed  in  X-space  (5l),  and  the  number  of  multiple  point  clouds. 
Additionally,  we  consider  the  number  of  regressors  out  of  k  that  are  imusual  for  the 
outlying  observations  as  a  factor  because  many  studies  only  consider  all  k  variables  in 
their  tests.  Only  the  SM&R  procedure  requires  response  values.  The  response  values 
conform  to  the  regression  surface  because  we  do  not  give  the  SM&R  procedure  an  vinfair 
advantage.  There  are  500  replicates  for  each  scenario  and  all  simulations  are  performed 
ixiS-Plus  4.5. 

The  simulation  results  are  reported  in  tables  that  provide  the  probability  of 
detection  and  the  probability  of  false  alarm  (in  parentheses)  in  each  cell.  We  also  report 
the  statistically  significant  effects  fi'om  the  analysis  of  variance  for  each  procedure.  The 
significant  main  effects  and  two  factor  interactions  provide  guidance  in  the  table  of  where 
to  look  for  significant  differences  in  performance.  Note  that  the  significant  effects  are 
valid  for  the  region  of  operability  defined  by  the  factor  settings  and  a  different  set  of 
effects  could  occur  if  the  fector  settings  are  changed. 
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4.4.2.1  Outlying  Observations  Unusual  in  All  Regressor  Variables 

This  fectorial  e3q)eriment  tests  the  procedures’  detection  ability  and  resistance  to 
swamping  when  the  outliers  are  placed  in  a  single  cloud  located  at  a  distance  5l  standard 
deviations  from  the  mean  for  all  k  regressor  variables.  The  generating  distribution  for  the 
clean  regressor  variables  is  )  with  =7.5  and  cr^  =4.  Ther  observation 

with  outlying  magnitude  8l  standard  deviations  is  placed  at  Xy  =  +  45l+  fory  = 

\\ok  regressor  variables  where  \ieanj  mean  of  the  known  clean  observations  for 
the regressor  variable.  The  random  component  Sy ,  distributed  Uniform  (0, 0.25), 

separates  the  observations  within  the  outlying  cloud  to  avoid  the  possibility  of  singular 
X-matrices  in  some  procedures.  An  observation  in  a  second  outlying  cloud  (if 
applicable)  is  placed  at  xy  =  -  46  +  f  J .  The  responses  for  aU  observations  are 

generated  y,  =  p  'x,  +  where  P  is  the  vector  of  known  regression  coefficients  selected 
for  the  simulations  to  be  0  for  the  intercept  and  5  for  each  of  the  k  regressor  variables  and 
Si  is  a  standard  normal  variate. 

The  2“*  factorial  design  in  Table  4.2  contains  in  each  cell  the  probability  of 
detection  and,  in  parentheses,  the  probability  of  felse  alarm.  For  completeness,  there  are 
four  additional  scenarios  added  to  test  the  detection  capability  at  higher  levels  of  leverage 
(6l)  in  high-density  and  high-dimension  scenarios  because  none  of  the  procedures 
reliably  detects  the  outliers  at  the  original  foctor  settings.  The  significant  main  effects 
and  two-fector  interactions  and  the  average  detection  and  felse  alarm  probabilities  that 


are  located  in  the  last  three  rows  of  Table  4.2  provide  summary  information  to  assess 
overall  performance. 
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There  are  three  distinct  categories  of  performance  based  on  detection  capability  in 
these  scenarios  1)  perfect  detection  capability  from  the  SM&R  clustering  procedure,  2) 
generally  very  good  power  from  the  MVE,  MCD,  R&W  and  the  Hadi  forward  selection 
algorithm  and  3)  poor  detection  capability  from  the  Mahalanobis  distance/hat  diagonal 
and  M-estimates  of  covariance.  The  reason  for  the  SM&R  success  is  explained  by  the 
single-linkage  clustering  algorithm  on  the  predicted  and  residual  values.  For  the  high- 
leverage  observations  in  these  runs,  the  OLS  residuals  are  essentially  zero  and  the 
predicted  values  are  quite  unusual  with  respect  to  the  clean  observations  (e.g.  for  ^  =  6,  6l 
=  4,  E(y)  =  225  for  the  clean  observations  and  E(y)  =  705  for  the  outliers). 

The  combinatorial  procedures  (MVE,  MCD,  and  R&W)  perform  well  except  in 
the  high-density,  high-dimension  runs  as  indicated  by  the  shading  in  Table  4.2.  The  four 
supplemental  runs  indicate  that  the  R&W  estimator  is  superior  to  the  MVE  or  MCD  in 
the  high-dimension,  high-density  scenarios;  particularly  when  false  alarm  probability  is 
considered.  We  also  note  from  Table  4.2  that  all  main  effects,  except  the  number  of 
clouds,  and  most  two-factor  interactions  with  these  three  active  effects,  are  significant  for 
these  combinatorial  estimators. 

The  Hadi  forward  selection  algorithm  has  less  ability  to  correctly  identify  outliers 
than  the  combinatorial  procedures.  Pilot  studies  show  power  could  be  increased 
substantially  (except  in  the  high-dimension,  high-density  scenarios)  if  the  cutoff  value 
were  lowered  because  there  is  a  significant  gap  in  robust  distances  between  the  clean  and 
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Table  4.2.  Design  matrix  with  detection  and  false  alarm  probabilities  (in 
parentheses)  for  high-leverage  observations  in  multiple  point  clouds  that  have  unusual 
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outlying  observations.  The  distances,  although  unusual,  do  not  cross  the  threshold  to 
declare  the  observations  outliers.  We  also  note  the  virtual  nonexistence  of  felse  alarms 
with  the  Hadi  method.  The  M-estimates  of  covariance  and  the  hat  diagonals 
(Mahalanobis  distance)  only  have  power  in  low-dimension,  low-density  and  high- 
magnitude.  Interestingly,  these  procedures  still  do  not  detect  outliers  in  high-dimension 
and/or  high-density  if  the  outlying  magnitude  6l  is  as  high  as  SOox.  Although  these  two 
procedures  are  not  useful  at  detection,  they  generally  will  not  swanqj  clean  observations. 

4.4.2.2  Outlying  Observations  that  are  Unusual  in  Only  One  of  k  Variables 

In  many  data  sets,  the  high-leverage  outliers  may  be  unusual  in  only  a  single 
variable  rather  than  the  entire  variable  set  as  is  often  investigated  in  published  data  sets 
and  in  Section  4.4.2. 1.  This  e3q)eriment  is  similar  to  Section  4.4.2. 1  only  the  last  ^  - 1 
regressor  variables  values  are  generated  fi’om  NID(7.5, 4^)  for  both  the  clean  and  outlying 
observations.  Essentially,  these  are  randomly  scattered  outliers  with  the  cloud(s)  formed 
only  in  a  single  regressor  variable.  Our  experiments  have  shown  for  all  methods  there  is 
a  dramatic  decrease  in  power  and  an  increase  in  felse  alarm  rate  if  the  remaining  k-\ 
variables  are  placed  approximately  at  the  mean  of  7.5  rather  than  allowed  to  randomly 
vary  as  NID  (7.5, 4^)  for  the  outlying  observations. 

Pilot  studies  indicate  that  none  of  the  procedures  have  any  detection  capability 
until  8l  =  4ox;  therefore,  the  low  level  for  leverage  magnitude  is  increased  for  this  study 
from  Sgx  to  4ax.  The  design  matrix  and  results  in  Table  4.3  are  supplemented  with  two 
additional  runs  at  a  higher  magnitude  5l  for  the  high-dimension,  high-density  runs. 
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The  results  for  this  section  are  generally  consistent  with  those  from  the  previous 
experiment  in  Section  4.4.2. 1;  however,  the  SM&R  clustering  algorithm  performance  has 
a  significant  decrease  in  detection  capability  and  increase  in  false  alarm  rate.  This  can  be 
attributed  to  tbe  outliers’  predicted  response  values  not  being  as  unusual  as  those  of 
Section  4.4.2. 1  because  only  one,  rather  than  all,  regressors  is  abnormally  large.  The 
SM&R  method's  detection  capability  is  competitive  with  the  others  in  low  dimension,  but 
has  little  power  in  high-dimension  and  suffers  from  high  false  alarm  rates  in  all  scenarios. 
The  MVE,  MCD  and  R&W  procedures  have  lost  significant  detection  capability  (30%  - 
50%)  from  the  previous  study  in  similar  scenarios  at  the  6l  =  4a  factor  settings  for 
outlying  magnitude.  The  R&W  estimator  is  either  at  or  near  the  top  in  detection 
capability  for  these  scenarios.  In  contrast  to  the  findings  in  Section  4.4.2.1,  we  note  that 
the  MCD  and  MVE  estimators  perform  reasonably  well  in  the  high-dimension,  high- 
density  runs,  particularly  for  false  alarm  probabilities.  The  combinatorial  estimators 
outperform  the  Hadi  procedure.  Significant  gaps  still  exist  for  the  Hadi  procedure 
between  the  outlier  and  inlier  robust  distances  and  also  there  are  no  false  alarms.  The  M- 
estimates  of  covariance  and  the  Mahalanobis  distance  again  are  poor  performers.  The 
fact  that  M-estimates  of  covariance  have  more  power  if  only  a  single  regressor  variable  is 
outlying  rather  than  all  k  is  somewhat  coimterintuitive. 
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Table  4.3.  Design  matrix  with  detection  and  false  alarm  probabilities  (in 


parentheses)  for  outlying  multiple  point  clouds  in  only  one  of  the  A:  variables. 
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4.4.2.3  High-density,  High-Magnitude  Outliers 

The  results  from  the  previous  two  studies  indicate  that  the  procedures  have 
difficulty  correctly  identifying  the  outliers  in  the  high-dimension,  high-density  scenarios. 
This  study  changes  the  levels  for  the  total  outlier  density  factor  from  10%  for  the  low 
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level  and  20%  for  the  high  level  to  1 5%  and  30%  respectively.  The  levels  for  the 
distance  the  cloud  is  shifted,  5l»  are  also  changed  fi’om  3o  and  4<y  to  5ct  and  lOo, 
respectively.  Because  the  number  of  clouds  is  generally  not  a  significant  fector 
contributing  to  the  performance  of  these  procedures,  it  is  set  to  a  constant  value  of  one  for 
all  scenarios.  The  fourth  feictor  is  now  the  number  of  regressor  variables  out  of  k  that 
have  outlying  values  ftir  the  planted  outliers.  The  low  setting  is  one  and  the  high  setting 
is  all  k  variables  (2  or  6).  The  Mahalanobis  distance  has  no  power  in  virtually  all 
scenarios;  therefore,  its  performance  is  not  included  with  the  results  in  any  further 
studies. 

The  most  interesting  result  fi-om  this  study  is  the  breakdown  of  the  procedures  at 
30%  density  shown  in  the  shaded  high-dimension  scenarios  of  Table  4.4.  The  false  alarm 
rates  are  abnormally  large  when  the  values  are  extreme  in  all  k  variables  for  these  runs 
and  also  the  shaded  run  in  low  dimension.  The  SM&R  clustering  procedure  performance 
is  consistent  with  previous  findings;  100%  detection  capability  if  outlying  in  all  regressor 
variables  and  a  significant  loss  of  power  if  outlying  only  in  a  single  variable.  Similarly, 
the  M-estimates  of  covariance  have  detection  power  limited  exclusively  to  low  dimension 
and  low-density  scenarios  independent  of  the  magnitude  of  the  outlying  distance.  The 
R&W  hybrid  estimator  is  typically  more  powerful  than  the  MCD  or  MVE.  Surprisingly, 
only  a  single  factor  (contamination  percentage)  is  significant  for  detection  capability  for 
R&W  and  none  are  significant  for  the  MCD  in  this  operating  region. 
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Table  4.4.  Design  matrix  with  detection  and  false  alam  probabilities  (in  parentheses) 
for  high-magnitude,  high-density,  high-leverage  scenarios.  Outliers  are  unusual  in  one  or 
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4.4.2.4  Outlying  Observations  with  Unusual  Levels  in  3  of  6  Variables 

This  section  investigates  the  difference  in  detection  capability  and  false  alarm 
rates  for  the  procedures  when  the  outliers  have  an  intermediate  factor  setting  of  outlying 
in  3  of  6  regressor  variables.  The  motivation  is  the  discrepancy  in  performance  when  the 
outliers  are  imusual  in  all  6  variables  versus  outlying  in  only  1  of  6.  There  are  only  three 
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factors  to  consider  in  this  study  because  the  dimension  is  set  at  ^  =  6  and  the  number  of 
outlying  variables  is  set  at  3.  The  results  are  shown  for  the  foil  factorial  in  three  fectors 
in  Table  4.5. 

From  Table  4.5,  the  overall  detection  probabilities  are  similar  to  those  seen  in  the 
runs  with  all  6  variables  outlying.  However,  the  false  alarm  averages  in  the  high-density 
scenarios  are  much  lower  compared  to  the  rates  when  outlying  in  all  6  variables.  The 
exception  to  this  is  the  SM&R  false  alarm  rates  near  20%  for  many  of  the  high-density 
scenarios.  Again,  the  R&W  hybrid  algorithm  slightly  outperforms  the  MCD  and  MVE  in 
most  cases.  The  MVE  is  vulnerable  in  the  high  outlier  density  scenarios  for  the  4  and  5a 


Table  4.5.  Design  matrbc  with  detection  and  felse  alarm  probabilities 


(in  parentheses)  for  clouds  that  are  remote  in  3  of  the  6  regressor  variables. 
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cases,  but  is  con^ietitive  with  the  other  combinatorial  procedures  by  6a  and  beyond.  The 
last  scenario  in  Table  4.5  again  shows  the  vulnerability  of  the  MVE  and  the  breakdown  of 
the  otherwise  excellent  performing  Hadi  forward  search. 

4.4.2.5  Outlying  Observations  Without  Unusual  Response  Values 

This  study  evaluates  the  performance  when  the  response  value  for  the  outlying 
observations  is  not  a  Y-space  outlier.  The  purpose  of  this  study  is  twofold.  First, 
interesting  results  can  occur  when  the  signs  of  regressor  variables  are  changed,  as  we  do 
here,  and  second,  to  investigate  the  effect  on  the  SM&R  algorithm  that  has  performed 
well  with  unusual  predicted  response  values.  Recall  that  the  regressor  variables  for  the 
clean  observations  are  generated  from  a  N(7.5, 4^)  distribution.  In  the  studies  to  this 
point,  an  observation  in  an  outlying  cloud  at,  for  example,  5l  =  4ax  in  two  regressor 
variables  would  be  placed  at  xl  =  x2  =  7.5  +  4(4)  +  Sy  =  23.5  where  is  distributed 

Uniform  (0, 0.25).  The  ejq)ected  response  value  would  be  approximately  5*23.5  + 

5*23.5  =  235.  The  expected  response  value  for  clean  observation  is  significantly  lower, 
5*  7.5  +  5*7.5  =  75.  For  the  scenarios  in  this  experiment,  the  outlying  cloud  is  placed 
approximately  4ax  above  the  mean  or  23.5  for  xl  and  4ax  below  the  mean  or  -8.5  for  x2. 
The  ejqjected  response  for  the  outliers,  5  *  23.5  +  5  *  (-8.5)  =  75,  is  now  the  same  as  that 
for  the  clean  observations.  The  scenarios  selected  for  the  study  in  Table  4.6  are  random; 
however,  the  results  are  consistent  independent  of  the  fiujtor  settings. 

The  results  indicate  that  SM&R  has  no  power  to  detect  outliers  in  X-space  if  the 
response  variable  is  not  unusual  and  the  least  squares  residuals  are  driven  essentially  to 
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zero  by  the  high-leverage  points.  The  20%  false  alarm  rate  for  this  method  is  consistent 
with  the  published  value  for  the  null  behavior.  The  other  methods  perform  slightly  below 


their  coimterpart  runs  when  all  variables  have  the  same  sign. 


Table  4.6.  Design  matrix  with  detection  and  false  alarm  probabilities 
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•This  scenario  has  the  outliers  at  xl=3a,  x2  =  3a  and  x3=  -6a  for  the  first  cloud  and  xl=- 


3a,  x2  =  -3a  and  x3=  6a  for  the  second  cloud. 


4.4.2.6  Multiple  Point  Clouds  that  are  in  Close  Proximity 

These  miscellaneous  scenarios  test  the  ability  to  identify  outlying  multiple  point 
clouds  positioned  next  to  each  other  in  X-space.  In  Sections  4.4.2. 1  through  4.4.2.4,  if 
there  are  two  multiple  point  clouds,  one  is  placed  at  +  fitax  and  the  other  at  -  8Lax.  The 
location  of  the  point  clouds  for  this  study  is  specified  in  the  outlying  magnitude  column 
of  Table  4.7.  These  scenarios  have  been  cited  as  challenging  in  the  literature  because  the 
outlying  clouds  can  mask  the  other  clouds  fi'om  detection. 

The  overall  results  are  mixed.  The  SM&R  detection  capability  is  consistent  with 
earlier  results:  excellent  detection  capability  if  outlying  in  3  of  6  or  all  6  variables,  poor 
detection  capability  if  outlying  in  a  single  variable,  and  high  felse  alarm  rates.  The 
shaded  scenarios  highlight  that  R«feW  outperforms  or  is  conqjetitive  with  the  MVE  and 


MCD.  Note,  in  particular,  the  uncharacteristic  high  false  alarm  rate  for  the  MCD 
procedure  in  these  three  shaded  scenarios. 


Table  4.7.  Design  matrix  with  detection  and  felse  alarm  probabilities 
in  narentheses)  for  multiple  point  clouds  located  in  close  proxumt" 
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4.4.3  Summary  of  Performance  for  Measures  of  Leverage 

This  section  summarizes  the  key  findings  from  the  comparative  evaluation  for 
each  technique.  There  was  no  need  to  test  the  Mahalanobis  distance  (hat  diagonal)  after 
the  first  few  experiments  because  it  has  no  power  to  detect  multiple  outliers  in  X-space 
except  under  some  specific  conditions. 

M-estimates  of  covariance.  The  usefiilness  of  the  il/-estimates  of  covariance 
distances  to  detect  multiple  outliers  in  X-space  is  limited  to  low-dimension,  low-density 
scenarios.  In  high-dimension,  the  outlying  leverage  distance,  8l,  can  be  increased 
without  boimd;  yet,  the  M-estimates  of  covariance  distances  for  the  planted  outliers  are 
still  not  unusual  compared  to  the  inlier  distances.  False  alarm  probabilities  are  always 
below  the  nominal  5%  rate  unless  in  high-dimension,  high-density.  Overall,  this  method 
is  only  slightly  preferable  to  the  Mahalanobis  distance  and  it  is  mferior  to  the  other 
methods  to  identify  outliers  in  X-speice. 

Hadi.  The  Hadi  forward  selection  algorithm  is  shown  to  have  decent  detection 
capability  with  very  low  false  alarm  probabilities.  Although  often  outperformed  by  other 
procedures,  significant  improvement  in  detection  capability  is  possible  by  lowering  the 
cutoff  values.  The  procedure  does  not  perform  well  in  high-dimension,  high-density. 
Further  experimentation  shows  little  improvement  in  high-dimension,  high-density 
scenarios  if  we  modify  the  cutoff  values. 

SM&R.  The  clustering  algorithm  of  the  least  squares  standardized  predicted  and 
residual  values  is  most  effective  when  the  predicted  values  are  unusual  with  respect  to  the 
response  values  for  the  clean  observations.  The  procedure  has  higher  detection 


129 

probabilities  if  the  number  of  unusual  regressors  is  large.  The  usefulness  of  this 
procedure  as  an  input  to  the  measure  of  leverage  is  limited  by  the  large  false  alarm  rates 
in  many  scenarios  and  the  general  requirement  for  unusual  predicted  response  values. 

MCD.  The  robust  distances  from  the  MCD  estimates  are  a  dependable  measure  of 
leverage  and  useful  to  detect  multiple  outliers.  The  simulated  cutoff  values  for  robust 
distances  from  the  genetic  algorithm  in  S-Plus  allow  reasonable  detection  capability  and 
limit  the  impact  from  false  alarms.  The  MCD  and  MVE  have  similar  performance.  The 
MCD  occasionally  will  significantly  outperform  the  MVE,  especially  ifk  =  6  and  the 
number  of  outlying  regressors  is  3  or  more.  There  are  several  high-dimension,  high- 
density  scenarios  where  both  the  MVE  and  MCD  foil  to  detect  the  outliers  highlighted 
throughout  Section  4.4.2.  In  these  instances,  the  MCD  has  a  much  greater  false  alarm 
probability  than  the  MVE  or  any  other  procedure.  These  are  the  only  instances  when  the 
MCD  procedure  exceeds  the  nominal  5%  false  alarm  rate. 

MVE.  Robust  distances  based  on  the  MVE  estimate  are  shown  to  reliably  detect 
high-leverage  observations  in  most  scenarios.  Overall,  the  MVE  from  the  genetic 
algorithm  is  competitive  with  the  other  combinatorial  estimators  (MCD  and  R&W),  but 
can  have  significantly  lower  detection  capability  in  high-dimension,  high-density 
scenarios.  The  use  of  chi-square  critical  values  for  the  robust  distances  has  a  consistently 
high  fiilse  alarm  rate.  Therefore,  a  moderate  decrease  in  detection  capability  from  the 
simulated  critical  values  is  a  worthwhile  tradeoff  to  control  false  alarms. 

Rocke  and  Woodruff.  This  hybrid  procedure  consistently  performed  as  well  as  or 
better  than  the  MVE  and  MCD  in  both  our  studies  and  those  of  the  authors.  The  one 
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exception  is  that  the  MCD  has  slightly  better  detection  probabilities  in  low-dimension, 
high-density.  R&W  has  superior  performance  to  the  MVE  and  MCD  in  the  challenging 
high-dimension,  high-density  scenarios.  In  feet,  there  are  only  two  scenarios  in  Section 
4.4.2.3  that  the  procedure  feils  to  detect  the  outliers  and  significantly  swamps  clean 
observations.  The  results  suggest  that  this  is  the  preferred  procedure  to  identify  outliers 
in  X-space.  Of  course,  if  the  outlying  observations  are  known  to  have  unusual  response 
values,  the  SM&R  algorithm  is  recommended. 

The  R&W  procedure  is  the  most  versatile  and  often  the  best  performer  in  our 
studies  and  we  suggest  its  use  as  a  measure  of  leverage.  Rocke  and  Woodruff  (1997)  cite 
instances,  most  notably  in  high-dimension,  when  the  MVE  and  MCD  procedures  alone 
are  likely  to  fail  where  theirs  >vill  not.  They  also  note  that  their  procedure  is  much  more 
computationally  efficient  for  high-dimension  and  large  sample  size  compared  to  the 
MCD  and  especially  the  MVE.  Our  experience  has  shown  reasonable  computation  times 
for  p  <  10, «  <  100  on  a  modest  PC  (PII-350, 96  MB  RAM)  when  implemented  through  a 
dynamically  linked  library  (DLL)  in  S-Plus  4.5.  However,  the  computation  times  for  the 
MVE,  MCD  or  M-estimates  of  covariance  are  significantly  less  than  those  of  R&W. 

4.5  Compound  Estimators  with  R&W  Robust  Distances  as  the  7c-weight  Component 

The  simulation  results  in  the  previous  section  indicate  that  the  M-estimates  of 
covariance  used  in  the  S&M  compound  estimator  and  the  robust  distances  using  the 
MVE  estimator  with  cutoff  values  used  in  C&H  are  not  the  strongest  performing 

techniques.  The  best  performing  alternative  is  the  R&W  robust  distances.  The  R&W 
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robust  distances  are  especially  powerful  in  high-dimension,  high  density  scenarios.  We 
now  evaluate  the  utility  of  using  the  R&W  robust  distances  as  the  component  in  the  tc- 
weights  for  the  S&M  and  C&H  estimators. 

Example  4.1  Revisited.  Although  the  exact  factor  Settings  in  Example  4.1  are  not 
run  in  Section  4.4,  the  results  from  Table  4.6  with  n  =  60,  ^  =  6,  8l  =  Sctx  and  outlying  in 
3  (rather  than  2  as  in  our  example)  of  the  6  variables  provide  an  accurate  indication  of 
performance.  The  M-estimates  of  covariance  have  virtually  no  power  to  detect  the 
leverage  points  and  the  MVE  robust  distances  are  only  about  60%  effective.  However, 
the  R&W  robust  distances  detect  the  outliers  approximately  95%  of  the  time.  The 
modified  S&M  estimator  replaces  the  A/-estimates  of  covariance  distances  with  the  R&W 
distances.  The  modified  C&H  estimator  replaces  the  MVE  robust  distances  with  the 
R&W  robust  distances.  The  C&H  n-weights  also  use  a  chi-square  cutoff  value;  we 
replace  it  with  the  R&W  cutoff  value  supplied  by  the  algorithm. 

Unfortunately,  the  p2irameter  estimates  and  residuals  for  both  modified  estimators 
have  changed  little  from  their  original  values  with  this  leverage  modification.  On  a 
positive  note,  many  of  the  final  weights  for  the  outliers  have  changed  to  a  value  between 
0.0  and  0.5  as  opposed  to  0.99  in  the  original  versions.  Also,  the  n-weights  are  now 
imusual  for  the  12  outlying  observations. 

Further  experimentation  with  modified  S&M  and  C&H  techniques  indicates  there 
is  no  real  adveintage  to  the  improved  7t-weights  in  virtually  all  high-leverage  scenarios. 
For  both  the  original  and  modified  versions  of  these  two  compound  estimators,  we 
observe  that  in  the  high-leverage,  high-dimension  scenarios  the  final  parameter  estimates 
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do  not  change  much  from  the  initial  (S  or  LTS)  values.  Therefore,  if  the  initial  estimate 
is  improved,  then  the  compound  estimators  may  have  a  better  chance  to  accommodate 
these  high-leverage  outliers. 

4.6  A  Proposal  for  a  New  Initial  Estimator 

The  high-breakdown  initial  estimators  typically  used  in  a  compound  or  GM 
technique  do  not  have  the  bounded-influence  property  and  thus  are  known  to  have 
difficulty  (i.e.  poor  parameter  estimates)  when  there  are  high-leverage  regression  outliers. 
We  consider  a  high-breakdown  “rejection  plus”  alternative  as  an  initial  estimator  that  first 
locates  and  then  eliminates  the  three  classes  of  outliers  from  the  data  set  followed  by 
parameter  estimation  from  a  least  squares  fit  on  the  reduced  data  set.  It  is  important  to 
clarify  that  we  use  the  full  data  set  for  sequential  stages  of  the  compoxmd  estimation 
scheme. 

For  the  proposed  initial  estimator,  high-leverage  outliers  are  first  eliminated, 
without  regard  to  how  well  they  fit  the  regression  surface,  if  the  R&W  robust  distance 
exceeds  the  algorithm’s  internally  calculated  cutoff  value.  The  remaining  observations 
should  all  be  on  the  interior  of  X-space.  From  Chapter  3  and  Simpson  and  Montgomery 
(1998b),  an  excellent  high-breakdown,  high-efficiency  estimator  for  low-leverage  outliers 
is  the  MM  estimator  (Yohai,  1987  and  Yohai  et  al.,  1991).  The  MM  estimator  has  three 
stages.  The  initial  estimate  is  a  high-breakdown  estimate  using  an  5-estimate.  The 
second  stage  computes  an  M-estimate  of  the  errors’  scale  from  the  initial  5-estiniate 
residuals.  The  last  step  is  an  M-estimate  of  the  regression  parameters  with  a 
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redescending  ^-function.  If  the  absolute  value  of  the  standardized  residuals  from  an 
MM  estimate  on  the  remaining  interior  X-space  data  exceeds  a  simulated  cutoff  value, 
then  these  observations  are  also  removed.  The  simulated  cutoff  value  is  1.91  for  both 
w=60,  A:  =  6  and  «  =  40,  =  2  based  on  the  95*  percentile  of  the  residuals  (absolute 
values)  from  1000  replications  of  uncontaminated  data.  In  practice,  it  is  probably 
reasonable  to  use  a  rule  of  thumb  of  2.0  to  avoid  the  added  complexity.  The  parameter 
estimates  for  the  proposed  initial  estimator  come  from  a  least  squares  fit  on  the  remaining 
observations  after  the  two  high-breakdown  filters  remove  the  leverage  and  residual 
outliers.  Therefore,  the  steps  of  the  proposed  initial  estimator  are  1)  remove  high- 
leverage  observations  if  the  R&W  robust  distance  exceeds  the  algorithm’s  internally 
calculated  cutoff  value,  2)  from  the  remaining  observations,  remove  the  residual  outliers 
if  the  MM  residual  exceeds  the  simulated  cutoff  value  and  3)  obtain  parameter  estimates 
with  an  OLS  fit  on  the  remaining  data. 

This  type  of  estimator  is  termed  a  “rejection-plus”  estimator  because  it  eliminates 
outlying  observations  from  the  data  and  uses  an  optimal  estimator  (OLS)  on  the  supposed 
remaining  clean  data.  The  “rejection  plus”  regression  estimator  logic  has  been  suggested 
in  the  literature  with  different  robust  regression  estimators  and  multiple  outlier  detection 
algorithms  (e.g.  Rousseeuw,  1984,  Simonoff,  1991,  Hadi  and  Simonoff,  1993,  and 
Wilcox,  1997).  He  and  Portnoy  (1992)  point  out  that  the  estimate  of  the  standard  error 
may  not  converge  to  the  correct  value  as  n  gets  large  for  these  procedures.  The  “rejection 
plus”  scheme  has  not  been  proposed  as  an  initial  estimator  for  multi-staged  techniques. 
There  is  less  of  a  concern  about  convergence  because  the  full  data  set  is  used  in  the 
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remaining  stages  of  the  GM  or  compound  estimate.  This  initial  estimator  is  appealing 
because  the  simulation  results  from  Chapter  3  and  Section  4.4  indicate  both  R&W  and 
MM  WG  powerfiil  across  a  comprehensive  set  of  scenarios,  yet  do  not  have  a  tendency  to 
false  alarm.  The  false  alarms  in  this  scheme  impact  the  efficiency  of  the  initial  estimator 
because  each  false  alarm  results  in  the  removal  of  a  clean  observation  from  the  final 
subset  used  in  the  OLS  estimate  of  the  parameters.  Another  advantage  for  this  type  of 
initial  estimator  is  that  it  is  likely  to  be  more  efficient  than  the  other  high-breakdown 
estimators  that  use,  in  some  cases,  only  half  of  the  observations  to  estimate  parameters. 

4.6.1  Initial  Estimator  Performance  Studies 

The  proposed  initial  estimator  is  tested  against  other  high-breakdown  initial 
estimators  m  the  literature.  We  consider  the  LMS,  LTS  (set  to  achieve  30%  breakdown), 
and  iS-estimators  in  addition  to  three  variants  of  the  proposed  estimator.  The  first  variant 
of  the  proposed  initial  estimator,  PI,  is  the  one  previously  described — ^an  R&W  filter  of 
high- leverage  points  followed  by  an  MA/ filter  of  residual  outliers  and  then  an  OLS  fit  to 
the  remaining  observations.  The  second  proposal  is  the  P2  estimator  with  parameter 
estimates  from  the  MM  fit  after  the  R&W  filter  rather  than  the  OLS  parameter  estimates 
used  in  PI .  The  P3  estimator  is  the  same  as  P2  except  that  an  S-estimator  is  used  in  place 
of  the  MM  estimator. 

The  performance  study  evaluates  the  effect  on  the  initial  estimators  from  multiple 
high-leverage  and  residual  outliers.  The  factors  are  dimension  (number  of  regressors), 
outlier  density,  leverage  (outlying  magnitude  in  X-space,  6l),  residual  magnitude 
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(outlying  distance  off  the  regression  surface,  8r),  and  the  number  of  regressor  variables 
out  of  k  that  are  unusual  in  X-space.  The  response,  efficiency  ratio,  is 
MSEcieJMSEestimator  whcrc  MSEdean  is  the  MSE  for  the  known  clean  observations  from  an 
OLS  fit  with  only  the  known  clean  observations.  MSEesumator  is  the  MSE  for  the  known 
clean  observations  from  the  fit  using  the  selected  estimator  on  the  entire  data  set.  Table 
4.8a  shows  the  2y  ’  design  matrix  and  average  efficiency  ratios  from  50  replicates  in  S- 
Plus  4.5.  Note  that  additional  replication  does  not  significantly  change  the  values  in 
Table  4.8a  and  does  not  change  the  key  findings. 

Table  4.8a.  Design  matrix  and  efficiency  ratios  for  common  initial  estimators. 


As  expected,  the  ability  of  the  common  initial  estimators  (LMS,  LTS  and  S)  to  fit 
the  clean  data  is  significantly  impacted  in  the  high-leverage  scenarios  tested  in  Table  4.8. 
The  LTS  and  S  estimators  consistently  outperform  LMS.  The  proposed  initial  estimators 
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PI  and  P2  have  much  better  efficiency  ratios  than  the  existing  procedures  and  P3.  A 
surprising  result  is  that  the  iS-estimator  and  P3  have  similar  results.  Removing  the  high- 
leverage  observations  apparently  has  little  effect  on  the  jS-estimator’s  performance  in 
these  selected  scenarios.  From  Table  4.8a,  PI  and  P2  are  the  preferred  alternatives. 

In  all  of  the  scenarios  in  Table  4.8a,  the  R&W  estimator  detected  the  planted 
outliers  because  they  were  extreme  in  X-space.  We  now  consider  the  performance  of  the 
initial  estimators  when  the  leverage  distance  8l  is  not  as  great.  The  R&W  estimator  does 
not  necessarily  detect  and  remove  all  of  the  planted  observations  unusual  in  X-space. 
Table  4.8b  includes  not  only  the  2^'  design  and  resulting  efficiency  ratios,  but  also  (in 
the  last  column)  the  proportion  of  outlying  observations  that  R&W  removes.  Although 
there  is  less  of  a  discrepancy  in  efficiency  ratios  between  the  first  two  proposed  initial 
estimators  and  the  alternatives,  PI  and  P2  again  have  the  best  results.  PI  slightly 
outperforms  P2  in  these  scenarios. 
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Table  4.8b.  Design  matrix  and  efficiency  ratios  for  common  initial  estimators 
when  R&W  does  not  necessarily  detect  the  leverage  points.  The  last  column  is  the 


proportion  of  planted  outliers  removed  by  the  R&W  filter. 


A 
n,  k 

B 

dens 

^  o 

D 

5r 

E 

out 

OLS 

LMS 

LTS 

s 

PI 

P2 

P3 

% 

RW 

40,2 

10% 

wa 

5 

2 

0.640 

0.825 

0.875 

0.904 

0.968 

0.976 

0.877 

0.50 

60,6 

10% 

IS 

5 

1 

0.606 

0.655 

0.787 

0.782 

0.921 

0.949 

0.734 

0.14 

40,2 

20% 

wa 

5 

1 

0.413 

0.840 

0.919 

0.921 

0.869 

0.705 

0.911 

0.07 

60,6 

20% 

IS 

5 

2 

0.366 

0.700 

0.841 

0.864 

0.748 

0.682 

0.804 

0.17 

40,2 

10% 

B 

5 

1 

0.681 

0.763 

0.851 

0.854 

0.956 

0.975 

0.807 

0.92 

60,6 

10% 

B 

5 

2 

0.599 

0.647 

0.752 

0.763 

0.937 

0.966 

0.715 

0.99 

40,2 

20% 

KB 

5 

2 

0.436 

0.661 

0.685 

0.683 

0.956 

0.969 

0.795 

0.99 

60,6 

20% 

B 

5 

1 

0.376 

0.564 

0.740 

0.715 

0.631 

0.643 

0.728 

0.17 

40,2 

10% 

wa 

10 

1 

0.376 

0.791 

0.871 

0.874 

0.961 

0.982 

0.843 

0.16 

60,6 

10% 

IS 

10 

2 

0.294 

0.679 

0.781 

0.794 

0.940 

0.974 

0.751 

0.24 

40,2 

20% 

IS 

10 

2 

0.159 

0.827 

0.922 

0.913 

0.966 

0.987 

0.859 

0.32 

60,6 

20% 

:wa 

10 

1 

0.141 

0.656 

0.829 

0.838 

0.938 

0.972 

0.816 

0.10 

40,2 

10% 

B 

10 

2 

0.347 

0.827 

0.884 

0.904 

0.966 

0.985 

0.858 

1.00 

60,6 

10% 

B 

10 

1 

0.310 

0.655 

0.785 

0.783 

0.931 

0.966 

0.732 

0.36 

40,2  i 

B 

10 

1 

0.247 

0.787 

0.885 

0.884 

0.957 

0.972 

0.825 

0.69 

60,6 

20% 

B 

10 

2 

0.149 

0.713 

0.857 

0.866 

0.932 

0.972 

0.716 

0.80 

Average 

effici 

ency 

0.384 

0.724 

0.829 

0.8831 

0.911 

0.917 

0.798 

Significant  EfiFects 

A,B, 

D 

A,C 

CD 

A,C, 

CD 

A,C, 

CD 

A,B, 

D,BD 

B.D, 

BD 

A,C 

4.7  Proposal  of  New  Compound  Estimators 

The  results  of  Sections  4.4  and  4.6  suggest  components  of  the  S&M  and  C&H 
compoimd  estimators  could  be  changed  to  increase  the  envelope  of  effective  performance 
in  high-leverage,  high-dimension  scenarios.  Section  4.4  clearly  indicates  superior 
performance  of  the  R&W  robust  distances  over  the  A/-estimates  of  covariance  distances 
and  moderately  better  performance  over  the  MVE  robust  distances.  Section  4.6  indicates 
that  an  improved  high-breakdown  initial  estimator  that  can  accommodate  high-leverage 
outliers  is  possible  by  using  PI  or  P2  rather  than  the  existing  LTS  and  5-estimators. 
Therefore,  the  components  of  the  proposed  compoimd  estimator  CEPl  are  a  PI  initial 
estimate,  R&W  robust  distances  as  the  measure  of  leverage,  7t-weights  as  the  ratio  of 
R&W  robust  distance  to  the  median  robust  distance,  an  LMS  estimate  of  scale  using  the 
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residuals  from  the  PI  initial  estimate,  a  Tukey  bi-weight  ^function  with  tuning  constant 
4.685  for  95%  efficiency,  and  a  one-step  convergence  from  IRLS.  CEPl  is  similar  to  the 
S&M  estimator  except  that  the  measure  of  leverage  is  R&W  versus  the  Af-estimates  of 
covariance  distances  and  the  initial  estimate  is  PI  versus  an  S-estimate.  Another 
difference  is  the  estimate  of  scale.  S&M  uses  the  S-estimate  of  scale  in  part  because  it  is 
readily  available  from  the  initial  5-estimate.  Rather  than  add  even  further  computational 
complexity  to  our  procedure,  we  use  the  LMS  estimate  of  scale  defined  as  1.4826  *  (1  + 
5/(w  -p))  *  median  |epi|  where  epi  is  the  vector  of  residuals  from  the  initial  fit  with  the  PI 
estimator.  The  second  proposed  compound  estimator,  CEP2,  is  the  same  as  CEPl  except 
that  the  initial  estimate  is  from  P2. 

We  also  consider  the  proposed  compound  estimators  CEPS  and  CEP4  that  are  the 
S&M  estimator  with  R&W  leverage  measures  modified  by  using  3  or  4  iterations, 
respectively,  of  IRLS  to  solve  the  normal  equations.  The  motivation  for  these  estimators 
comes  from  the  improvement  in  final  weights  in  the  example  problem.  Also,  He  and 
Portnoy  (1992)  suggest  that  in  practice  a  single  step  of  a  GM  iteration  scheme  is  often 
insufficient.  Pilot  Studies  showed  that  the  parameter  estimates  do  not  change 
significantly  after  2  iterations  and  that  at  least  3  are  required  to  effectively  accommodate 
the  high-leverage  outliers  across  a  variety  of  test  scenarios.  Simpson  and  Chang  (1997) 
demonstrate  that  several  iterations  still  maintain  the  same  first  order  large  sample 
properties  as  the  single  iteration  version  of  the  compound  estimator. 

Example  4.1  provides  an  initial  performance  indication  of  the  four  proposed 
compound  estimators.  All  four  estimators  effectively  identify  and  downweight  the  12 
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planted  outliers  in  this  example.  Their  parameter  estimates  are  similar  to  each  other. 
These  estimates  are  much  closer  to  both  the  generating  P  vector  and  the  estimates  from 
OLS  fit  on  the  known  48  clean  observations.  Additionally,  the  initial  estunate  for  CEPl 
and  CEP2  is  efficient  because  the  correct  12  outlying  observations  are  rejected  and  44  out 
of  the  48  clean  observations  are  used  in  the  OLS  computation  of  the  initial  parameter 
estimates.  Table  4.9  summarizes  the  performance  of  the  estimators  tested  for  Example 
4.1. 


Table  4.9.  Estimator  performance  for  Example  4.1.  Mod  S&M  and  Mod  C&H 
are  the  modified  versions  using  the  R&W  robust  distances  for  the  measure  of  leverage 
with  all  other  components  the  same.  Unusual  residual  and  n  -weights  indicate  if  those 
measures  are  significantly  different  for  the  12  outlying  cases.  Swamped  cases  refers  to 
the  number  of  cases  out  of  the  clean  48  that  have  a  standardized  residual  value  exceeding 


2.5  in  absolute 


Estimator 

MSE  for 
clean  cases 

Unusual 

Residual? 

Unusual 
n  -weights? 

Swamped 

Cases 

2.716 

No  . 

NA 

7 

S&M 

3.063 

No 

No 

8 

C&H 

3.009 

No 

No 

7 

Mod  S&M 
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No 
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8 

Mod  C&H 
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No 
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7 
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No 

NA 

8 
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No 

NA 

8 

CEPl 

0.891  ^ 
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CEP2 

0.890 

Yes  ^ 

Yes 

CEPS 

0.976 

Yes 

Yes 

0 

CEP4 

0.902 

Yes 

Yes 

0 

4.8  Performance  of  the  Proposed  Compound  Estimators 

This  section  evaluates  how  well  the  proposed  estimators  perform  beyond  the 
single  example  discussed.  Example  4. 1  clearly  demonstrates  the  ability  of  the  four 
proposed  estimators  to  accommodate  the  outliers  in  this  high-leverage,  high-dimension 
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example  while  the  C&H  and  S&M  estimators  do  not.  The  example  does  not  discriminate 
the  performance  between  these  four  proposed  procedures.  This  section  tests  the 
compound  estimators  through  Monte  Carlo  simulations  to  quantify  the  ability  to 
accommodate  outliers. 

4.8.1  Proposed  Estimators’  Area  of  Coverage 

The  scenario  for  Example  4.1  that  develops  the  need  for  a  new  compound 
estimator  uses  n  —  60,  k  =  6  with  20%  outliers  at  a  leverage  magnitude  of  Scx  in  2  of  the 
6  regressor  variables  and  a  residual  magnitude  lOoe-  The  proposed  estimators  effectively 
downweight  the  observations  while  the  other  compound  estimators  did  not.  The  Monte 
Carlo  simulation  studies  in  Chapter  3  and  Section  4.4  indicate  that  the  leverage  and 
residual  magnitudes  (and  their  two-fector  interaction)  are  the  most  important  fectors 
influencing  the  performance  of  the  tested  procedures.  Figure  4.1  provides  a  leverage  and 
residual  magnitude  sensitivity  analysis  for  the  compound  estimators.  The  measure  of 
performance  is  whether  or  not  the  estimator  identifies  and  downweights  the  planted 
outliers  given  the  level  of  leverage  (between  8x  =  0.0  and  10.0  CTx)  and  residual  (between 
8r  =  3.0  and  10.0  CTe)  magnitude.  Fixed  are  the  number  of  observations  at  60,  the  number 
of  regressors  at  6,  the  percentage  of  outliers  in  the  single  multiple  point  cloud  at  20%,  and 
the  number  of  regressors  out  of  6  with  high-leverage  points  at  2. 

A  method  is  successful  in  a  single  replicate  if  the  average  standardized  residual 
value  for  the  12  outliers  is  greater  than  or  equal  to  2.5.  There  are  50  replications  and  if 
the  method  is  successful  in  at  least  70%  of  the  relicates,  then  it  is  deemed  successful  for 
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that  combination  of  leverage  and  residual  magnitude.  The  actual  percentage  of  correct 
outlier  identification  is  shown  in  the  Appendix  B.  An  estimator  successfully 
accommodates  outliers  at  all  combinations  of  leverage  and  residual  magnitude  above  the 
line  shown  in  Figure  4. 1 .  The  C&H  and  S&M  estimators  do  not  have  any  power  beyond 
8l  =  40X-  At  6r  =  15ae  the  S&M  procedure  will  downweight  the  12  outliers  because  the 
initial  5-estimate  has  detected  the  outliers  and  the  parameter  estimates  do  not  change 
much  in  the  remaining  stages.  The  7C- weights  are  still  not  imusual.  If  8l  =  IOgx,  the 
S&M  estimator  accommodates  the  12  outliers  when  6r  >  20cye  for  the  same  reason. 
Figure  4.1  displays  how  the  proposed  compound  estimators  increase  the  envelope  of 
performance,  particularly  in  8l,  over  the  other  estimators.  CEPl  and  CEP2  are  preferred 
over  S&M,  C&H,  CEPS  or  CEP4.  Although  the  levels  of  n,  k,  density  and  number  of 
clouds  are  fixed,  the  simulation  results  from  both  Chapter  3  and  Section  4.4  suggest 
significant  increases  in  the  envelope  of  performance  are  possible  across  a  variety  of 


factor  level  combinations. 
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Figure  4.1,  Approximate  area  ofcoverage  for  the  6  compoimd  estimators.  The 
data  set  consists  of  a  single  outlying  multiple  point  cloud  with  factor  settings  n  =  60,  k  = 
6,  outlier  density  =  20%  and  outlying  in  2  of  6  regressor  variables.  The  X-axis  measures 
the  leverage,  5l,  in  standard  deviation  units  and  the  Y-axis  measures  the  cloud’s  outlying 
distance  in  residual,  8r,  in  standard  deviation  units.  The  area  above  the  line  for  each 
technique  indicates  where  the  estimator  is  at  least  70%  effective  in  identifying  the  planted 
outliers.  Note  that  there  is  no  coverage  until  at  least  5r  =  15oe  for  lines  S&M  and  C&H 
above  5l=  4  standard  deviation  units. 


4.8.2  Performance  in  Published  Scenarios 

Simpson  and  Montgomery  (1998b)  conduct  a  performance  study  using  Monte 
Carlo  simulation  in  24  outlier  scenarios  to  evaluate  several  common  and  proposed  robust 
regression  procedures.  The  study  considers  four  fectors:  1)  number  of  regressors  and 
observations  with  levels  it=2,  w=16;  k=S,  n=40;  and  k=\Q,  /7=80;  2)  outlier  density  with 
levels  10%  and  20%;  3)  outlier  leverage;  and  4)  the  presence  or  absence  of 
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approximately  20%  high-leverage  observations.  The  regressor  variables  are  placed  in  a  2 
level  &ctorial  designed  experiment  arrangement  with  levels  of  ±  1.  There  are  also 
approximately  20%  axial  design  points  with  levels  of  0  for  all  but  one  regressor  variable 

whose  level  is  ±4k .  The  high- leverage  observations  replace  4k  with  a  value  between  5 
and  14.  Note  that  the  design  matrix  does  not  change  in  the  simulation  replicates  within 
an  outlier  scenario.  The  /'*  response  value  is  generated  by  =p  'x.  +  where  p  is  the 

vector  of  known  coefficients  for  the  simulation,  x.  is  /'*  row  of  the  design  matrix,  and 
is  NID  (0, 1)  for  the  clean  observations  and  a  large  constant  for  the  planted  outliers. 
The  measure  of  performance  is  the  mean  square  error  of  estimation  defined  as 

A  A  A 

MSEE  =  (p ;;  -p  )'(P  -p  )  where  p  ^  is  the  vector  of  parameter  estimates  from  the  robust 
technique  and  P  is  the  vector  of  known  model  coefficients. 

The  scenario  descriptions  and  simulation  results  for  100  replicates  of  the  24 
scenarios  are  shown  in  Table  4.10.  The  average  MSEE  (AMSEE)  is  the  average  of  the 
MSEE  for  the  1 00  replicates.  The  second  to  last  column  is  the  percent  of  the  total 
observations  included  in  the  initial  estimate  for  PI  and  the  last  column  is  the  percent  of 
total  observations  that  PI  should  ideally  have  used  in  the  initial  estimate.  PI  is  an 
efficient  estimator  in  these  scenarios  because  the  ratio  of  the  observed  to  the  ejqjected 
value  is  consistently  above  95%  for  the  24  data  sets.  If  PI  uses  significantly  fewer 
observations  than  ejq)ected,  then  many  felse  alarms  have  resulted.  Conversely,  if  PI  uses 
more  than  the  ejq)ected  observations,  then  it  has  included  high-leverage  observations  or 
interior  residual  outliers  in  its  parameter  estimates. 
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The  design  matrices  for  the  simulations  had  to  be  slightly  altered  from  Simpson 
and  Montgomery  (1998b)  by  adding  a  realization  ofN(0, 0.1^)  to  every  level  of  the  k 
regressors  because  the  R&W  technique  and  MM  estimator  have  difficulty  with  singularity 
when  the  levels  are  ±  1 .  This  modification  does  not  change  the  overall  results  much  from 
Simpson  and  Montgomery  (1998b).  Also,  for  the  6  different  design  matrices,  X,  used  in 
the  24  outlier  scenarios,  all  measures  of  leverage  for  the  compound  estimators  (MVE,  M- 
estimates  of  Covariance,  and  R&W)  correctly  assign  a  large  distance  to  all  high-leverage 
observations  and  do  not  assign  a  large  distance  to  any  low-leverage  observation. 
Therefore,  the  measure  of  leverage  is  not  a  discriminating  factor  affecting  candidate 
compoimd  estimator  performance. 

The  results  for  the  published  estimators  in  Table  4. 1 0  are  consistent  with  those  of 
Simpson  and  Montgomery  (1998b,  page  1044).  Of  the  new  proposals,  CEPl,  CEPS  and 
CEP4  perform  similarly  and  are  moderately  better  than  CEP2.  S&M  still  outperforms  all 
other  estimators  in  these  scenarios.  CEPl,  CEPS  and  CEP4  are  competitive  with  S&M 
except  when  the  20%  high-leverage  points  are  present  with  another  20%  outliers  on  the 
interior  of  X-space  (data  sets  10-12).  The  proposed  estimators  outperform  S&M  in  the 
high-leverage  outlier  scenario  of  data  set  17,  otherwise  the  techniques  have  similar 
performance.  Overall,  the  proposed  estimators  are  strong  performers  with  only  one  area 
of  vulnerability  (DSIO).  The  performance  of  the  stand-alone  MM  estimator  in  Table  4.9 
should  not  be  overlooked.  It  is  vulnerable  only  for  high-leverage  outliers  in  low 


Table  4. 10.  Average  Mean  Square  Error  of  Estimation  (AMSEE)  for  robust  regression  techniques  using  Simpson  and 
Montgomery  (1998b)  data  sets.  CEPl  is  the  proposed  compound  estimator  using  with  R&W  measures  of  leverage  and  PI 
initial  estimates  and  CEP2  is  the  proposed  compound  estimator  with  R&W  measures  of  leverage  and  P2  initial  estimate.  CEP3 


145 


4a 

03 

..S 

4-* 

.s 

cy5 


C+H 

o 

a 

o 

(U 


o 

I 

a> 

1-4 

JJ 

o 

t/1 

£ 

g 

a 

s 


4-» 

DO 

.s 

C/3 

3 

O 


TJ 

0) 

CO 

a 

s 

a 

1 

2 
w 
u 

I 


Exp  % 
obs 

88.0  1 

90.0  1 

O 

d 

a^ 

s 

oo 

o 

d 

00 

O 

d 

00 

to 

VO 

O 

d 

4^ 

o 

d 

4^ 

CO 

vd 

to 

o 

d 

VO 

o 

s 

o 

to 

4^ 

o 

d 

00 

o 

d 

00 

o 

»d 

4^ 

O 

d 

00 

O 

d 

00 

00 

od 

VO 

75.0  1 

75.0  1 

68.8 

O 

d 

4^ 

70.0  1 

Tt 

00 

vO 

VO 

00 

v6 

oo 

00 

0^ 

04 

4^ 

d 

VO 

to 

4^ 

VO 

CO 

4^ 

VO 

VO 

d 

to 

4^* 

to 

00 

to 

00 

4^ 

00 

vd 

4^ 

-J 

to 

vd 

4^ 

00 

4^ 

00 

vd 

4^ 

— 

to 

vd 

4^ 

VO 

VO 

q 

04* 

4^ 

q 

04 

4^ 

O 

vd 

VO 

to 

4> 

VO 

CO 

4^ 

VO 

W  '•I- 

u 

s 

<N 

d 

d 

04 

to 

d 

o- 

00 

fO 

d 

o\ 

CO 

cs 

d 

ON 

VO 

d 

to 

Tf 

CO 

d 

O 

4^ 

04 

d 

On 

00 

d 

04 

00 

o- 

d 

04 

CO 

d 

4^ 

04 

04 

d 

04 

to 

04 

d 

C3N 

04 

04 

d 

S 

© 

to 

04 

d 

04 

04 

d 

3 

d 

On 

Ol 

d 

OO 

04 

d 

CO 

4^ 

d 

CO 

d 

p: 

CN 

d 

oo 

00 

d 

04 

VO* 

W  fo 

u 

d 

eo 

d 

<o 

to 

d 

o 

oo 

fO 

d 

o 

CO 

<N 

d 

On 

VO 

d 

CO 

CO 

d 

00 

VO 

CO 

d 

00 

d 

Ov 

to 

VO 

d 

04 

CO 

d 

4^ 

CO 

d 

o 

to 

04 

d 

CO 

04 

d 

04 

VO 

d 

CO 

to 

04 

d 

o 

CO 

04 

d 

s 

d 

VO 

oo 

04 

d 

VO 

04 

d 

g 

d 

oo 

CO 

d 

4^ 

4^ 

04 

d 

to 

oo 

d 

Tt 

vd 

U 

U  eu 

vS 

d 

cs 
*— < 

d 

o 

to 

d 

o 

5- 

d 

CO 

CN 

d 

00 

VO 

d 

Si 

CO 

d 

s 

04 

d 

o^ 

d 

& 

00 

d 

4-- 

CO 

to 

d 

CO 

04 

d 

CO 

VO 

04 

d 

to 

CO 

<N 

d 

d 

s 

04 

d 

CO 

CO 

04 

d 

« 

d 

04 

CO 

CO 

d 

CO 

to 

04 

d 

g 

d 

CO 

d 

oo 

04 

d 

s 

d 

On 

vd 

U  cu 

VO 

(N 

d 

<N 

d 

VO 

to 

d 

o- 

VO 

fO 

d 

5 

04 

d 

S 

d 

§ 

CO 

d 

CO 

00 

04 

d 

)Q 

Ov 

d 

VO 

oo 

4^ 

d 

Ov 

CO 

d 

ON 

CO 

d 

CO 

to 

04 

d 

VO 

CO 

04 

d 

d 

VO 

to 

04 

d 

CO 

04 

d 

On 

VO 

d 

4^ 

q 

d 

VO 

to 

q 

d 

to 

4^ 

d 

VO 

oo 

q 

d 

to 

00 

q 

d 

On 

d 

On 

q 

d 

S&M 

<N 

d 

o\ 

04 

04 

d 

S 

d 

VO 

cn 

fO 

d 

oo 

04 

d 

P 

d 

g 

d 

Ov 

d 

00 

o 

d 

00 

oo 

d 

0\ 

d 

4^ 

O 

d 

to 

04 

04 

d 

4^ 

ON 

04 

d 

5 

o 

to 

00 

04 

d 

to 

04 

4^ 

d 

On 

VO 

d 

4^ 

VO 

d 

4^ 

04 

d 

04 

04 

d 

to 

04 

d 

00 

4^ 

04 

d 

4^ 

d 

CO 

q 

•d 

C&H 

fN 

to 

fO 

d 

IS 

fO 

d 

04 

d 

G 

Tj- 

d 

to 

d 

2 

CO 

d 

o 

CO 

04 

d 

00 

CO 

d 

00 

04 

d 

CO 

CO 

04 

d 

04 

04 

CO 

d 

4^ 

CO 

04 

d 

to 

04 

CO 

d 

VO 

d 

? 

CO 

d 

o 

CO 

d 

VO 

q 

VO 

rr 

d 

to 

to 

04 

d 

1 

d 

VO 

CO 

d 

CO 

CO 

d 

VO 

Tj- 

d 

s 

CO 

d 

oo 

04 

On 

MM 

to 

to 

<N 

d 

04 

d 

d 

VO 

ro 

d 

Ol 

d 

« 

d 

o 

04 

d 

4^ 

O 

d 

04 

to 

o 

d 

00 

to 

d 

to 

00 

o 

d 

VO 

to 

o 

d 

00 

CO 

04 

d 

s 

CO 

d 

S 

d 

p: 

4^ 

d 

VO 

Si 

d 

S 

d 

On 

04 

d 

d 

g 

d 

s 

Tf 

d 

s 

CO 

d 

o 

CO 

d 

oo 

to 

on 

S 

d 

fO 

oo 

d 

fO 

d 

to 

lO 

to 

d 

00 

o 

CO 

d 

CO 

04 

d 

s 

04 

d 

On 

d 

to 

25 

d 

4^ 

04 

d 

§ 

d 

CO 

CO 

d 

2 

to 

d 

On 

04 

d 

4^ 

On 

CO 

d 

to 

o 

q 

4^ 

CO 

CO 

d 

1  CO 
'04 

d 

o 

00 

CO 

04 

d 

s: 

04 

d 

to 

oo 

CO 

d 

CO 

04 

d 

VO 

q 

od 

LTS 

\o 

to 

d 

S 

VO 

d 

04 

fO 

to 

d 

'  CO 
Osl 
VO 

d 

to 

VO 

d 

to 

d 

CO 

£ 

VO 

CO 

d 

04 

CO 

d 

o 

CO 

d 

00 

00 

CO 

d 

On 

04 

d 

CO 

CO 

,d 

4^ 

to 

4^ 

d 

Tj- 

d 

CO 

CO 

to 

d 

CO 

oo 

04 

o 

to 

d 

04 

to 

CO 

d 

ito 

VO 

to 

d 

VO 

ON 

CO 

d 

oo 

[d 

CO 

to 

to 

d 

s 

rf 

d 

to 

CO 

<N 

VO 

d 

04 

04 

O'! 

d 

o 

to 

d 

oo 

CO 

d 

00 

CO 

04 

d 

!  OO 

IVO 

d 

t: 

04 

d 

d 

04 

to 

o 

d 

? 

d 

i 

d 

to 

o 

d 

1  CO 

,4^ 

id 

4% 

04 

to 

VO 

04 

d 

ON 

VO 

ON 

to 

04 

CO 

4^ 

04 

O 

4^ 

o 

CO 

d 

s 

VO 

d 

25 

d 

Ip: 

2 

lH 

00 

o 

q 

9 

q 

Ol 

VO 

2 

P 

c4 

CO 

04 

oo 

00 

o4 

CO 

04* 

pj 

O; 

to 

id 

to 

d 

CO 

oo 

to 

d 

ov 

CO 

00 

d 

25 

5! 

d 

04 

to 

i 

On 

to 

pi 

|00 

104 

00 

04 

rS 

q 

04 

to 

4^ 

q 

o 

CO 

s 

oo 

q 

04 

£! 

q 

CO 

Tf 

to 

Description 

o 

a 

i 

i 

fS 

> 

<N 

o 

c 

g 

.£ 

1 

> 

VO 

§ 

c 

M 

i 

1 

O 

§ 

!  C 

M 

c 

o 

c 

.S 

04 

> 

VO 

g 

o 

c 

.f 

04 

1 

.f 

04 

> 

04 

i 

1 

> 

VO 

i 

s 

> 

o 

o 

i 

> 

04 

i 

04 

> 

VO 

> 

JJ 

04 

> 

o 

> 

\ 

!o4 

> 

04 

> 

i 

c 

i 

1 

> 

o 

> 

i 

> 

04 

> 

i 

04 

> 

VO 

<1> 

04 

> 

O 

1 

04 

> 

04 

1 

1 

> 

SO 

1 

1 

> 

o 

B 

II 

|| 

> 

04 

1 

04 

> 

VO 

1 

04 

> 

O 

IS 

c/3 

P 

- 

04 

cn 

to 

VO 

4^ 

00 

Ov 

O 

04 

CO 

to 

VO 

4^ 

00 

ON 

o 

04 

Ol 

04 

CO 

04 

Tf 

04 

146 


dimension.  For  this  reason,  CEPl,  which  incorporates  an  MAf  estimator,  is  probably  the 
best  ahemative  to  protect  against  multiple  outliers  in  high-dimension.  Although  CEPl  is 
computationally  complex,  a  data  set  of  w  =  100  observations  with  A  =  10  regressor 
variables  requires  approximately  20  seconds  on  a  modest  PC  (2  seconds  for  S&M  and  5 
seconds  for  C&H).  The  procedure  could  be  made  much  more  conqjutationally  efficient  if 
it  did  not  go  through  the  S-Plus  interface. 

4.9  Summary 

This  chapter  develops  new  compound  estimators  that  greatly  ejq)and  the  region  of 
effective  performance  in  the  presence  of  high-leverage  outliers  over  existing  procedures. 
A  comprehensive  simulation  study  on  common  measures  of  leverage  indicates  that  the 
R&W  procedure  is  the  most  robust  across  a  variety  of  scenarios.  The  improved  measure 
of  leverage  alone  does  not  significantly  improve  a  compound  estimator’s  performance 
with  high-leverage  outliers  unless  several  more  iterations  are  added  to  the  IRLS  solution 
to  the  GM  normal  equations.  The  common  high-breakdown  initial  estimators  (LMS,  LTS 
and  S)  are  vulnerable  to  the  high-leverage  outliers  and  provide  inferior  initial  estimates. 
Good  initial  estimates  are  essential  because  often  the  final  estimates  fi*om  a  compound 
estimator  do  not  significantly  change  when  they  should.  Also,  the  estimate  of  scale  for  a 
compoimd  estimator  is  based  on  the  residuals  fi'om  the  initial  estimate. 

Our  approach  is  to  provide  an  initial  estimate  based  only  on  observations  that  are 
not  high-leverage  points  or  residual  outliers.  This  provides  an  efficient  estimator  because 
50%  of  the  observations  are  not  removed  fi'om  the  sample  as  they  are  in  LMS  and  LTS. 
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The  R&W  and  MA/ filters  remove  a  variable  percentage  of  the  data.  Studies  show  that 
this  scheme  leaves  approximately  95%  of  the  clean  observations  in  the  initial  estimate 
independent  of  outlier  density  or  dimension.  Therefore,  the  proposed  initial  estimator  is 
efficient,  high-breakdown  and  bounded-influence.  The  next  stages  of  a  compound 
estimator  can  then  smooth  in  the  outliers  based  on  the  user’s  downweighting  philosophy 
through  the  choice  of  y/  -function. 

Simulation  studies  indicate  our  proposed  estimator  CEPl  is  competitive  with  the 
top  performing  robust  regression  techmques  tested  in  published  scenarios  and  preferred 
in  high- leverage,  high-dimension  scenarios.  CEPl  uses  the  R&W  robust  distances  for  the 
leverage  measure,  an  initial  estimate  from  OLS  after  the  R&W  and  MM  filters,  and  an 
LMS  estimate  of  scale  based  on  the  residuals  from  the  initial  estimate.  Figure  4.1 
provides  an  indication  of  the  substantial  improvement  in  performance  this  estimator  has 
in  the  presence  of  high-leverage  outliers  over  other  robust  regression  techniques. 


Chapter  5 

Resampling  Methods  for  Variable  Selection  in 
Least  Squares  and  Robust  Regression 


5.1  Introduction 

An  important  aspect  of  the  regression  model  building  strategy  is  selecting  the 

appropriate  subset  from  the  candidate  regressor  variables.  We  consider  the  usual 

multiple  regression  model,  y  =  Xp  +e  where  X  is  the  nx  p  matrix  of  regressor  variables, 

p  the  p  vector  of  parameters  and  e  the  random  error  assumed  to  be  independent  and 

identically  distributed  (i.i.d.)  with  mean  0  and  variance  cr^I .  The  P  vector  is  partitioned 

into  an  active  variable  set,  ^  p-q  parameters  and  inactive  set  p  j  of ^  parameters  to 

test  the  hypothesis  that 

Ho:p2=0 
Ha:  Pj^O. 

Failure  to  reject  the  null  hypothesis  suggests  there  is  no  evidence  that  any  of  the  regressor 
variables  in  set  p  j  ^  effect  on  the  response  value. 

The  goal  of  a  variable  selection  procedure  is  to  have  the  significant  regressor 
variables  included  in  set  p ,  with  high  probability,  while  simultaneously  achieving  a  high 
probability  that  the  insignificant  variables  are  contained  in  set  p  2 .  The  regression  model 
building  strategy  is  an  iterative  process  that  involves  selection  of  an  active  subset  of  the  p 
regressors  followed  by  model  diagnostics  to  assess  the  fit.  The  objective  is  to  find  the 
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best  subset  of  the  p  parameters  to  include  in  the  model  that  leads  to  good  prediction 
capability  yet  minimizes  the  variance  of  prediction.  The  first  objective  would  suggest 
including  all  p  variables  while  the  second  suggests  using  as  small  of  a  subset  as  possible 
because  the  variance  of  prediction  always  increases  as  regressor  variables  are  added. 
Models  with  fewer  variables  are  also  preferred  for  simplicity  in  interpretation  and  ease  of 
future  data  collection. 

There  are  numerous  variable  selection  methods  available  to  the  analyst.  The 
simplest  approach  is  to  retain  only  the  variables  whose  ratio  of  coefficient  to  the  standard 
error  is  significant.  This  f-test  approach  is  not  reliable  with  increasing  dimension, 
particularly  when  dependencies  between  regressor  variables  exist.  A  common  alternative 
is  the  class  of  computer-intensive  variable  selection  methods  (e.g.  forward,  backward, 
stepwise,  and  best  subsets  regression).  The  selection  criteria  are  often  based  on  F-tests 
(F-to-enter  and  F-to-leave)  or  Mallows’s  (1973)  Cp  criterion; 

+  2p  where  jp^is  the  predicted  value  and  <T^is  typically  the 

^  <T  /=1 

mean  square  error  {MSe)  fi’om  the  full  model.  Unfortunately,  Miller  (1990)  demonstrates 
that  the  F  tests  and  Mallow’s  Cp  criterion  are  poor  for  model  selection  as  are  the  and 
adjusted  measures.  Breiman  (1 995)  states  that  the  preferred  measure  of  performance 
for  variable  selection  in  regression  is  some  measure  of  prediction  error. 

The  resubstitution  or  apparent  prediction  error  for  a  regression  model  is  defined 

by  Kyipp  =  -  PiT  •  Note  that  this  quantity  differs  from  the  usual  MSe  estimate 

»=! 

because  n  is  used  in  place  of  n  -p.  The  final  prediction  error  (FPE)  criteria  (Zhang, 
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1992)  accounts  for  the  number  of  regressors  in  the  model  and  is  computed  by  +  Ap 

where  A  is  the  penalty  constant  for  including  extra  variables.  When  A,  =2,  FPE  can  be 
shown  to  be  equivalent  to  Mallows’s  Cp,  It  has  been  well  documented  that  the  FPE,  Cp 

and  A^^^measmes  are  highly  biased  (Miller,  1990,  Zhang,  1992,  Shao,  1993,  and 

Breiman,  1995,  Davison  and  Hinkley,  1997)  and  not  recommended  for  variable  selection. 
Direct  minimization  of  these  measures  leads  to  models  that  have  too  many  significant 
variables;  the  dimension  of  p ,  is  too  large. 

Several  authors  have  proposed  computationally  complex  resampling  methods  to 
address  the  shortcomings  of  the  usual  methods  for  variable  selection.  Common 
resampling  methods  are  cross-validation  and  bootstrapping.  We  describe  several  cross- 
validation  and  bootstrap  resampling  methods  to  calculate  prediction  error  in  Section  5.2. 
Each  of  these  methods  suggests  selecting  the  model  with  the  minimum  prediction  error 
among  the  conq)etitors.  Our  approach  is  to  relax  the  requirement  for  the  absolute 
minimum  prediction  error  and  select  a  model  that  has  the  fewest  number  of  variables  and 
a  low  (not  necessarily  minimum)  prediction  error.  This  criterion  is  effective  with  both 
cross-validation  and  bootstrap  estimates  of  prediction  error.  It  is  om  belief  that  in  low 
dimension  (p  <  10),  a  reasonable  strategy  is  to  look  at  a  screeplot  (scatterplot  of  the 
number  of  parameters  versus  prediction  error)  of  candidate  models  of  increasing 
dimension.  The  model  with  the  fewest  parameters  where  the  curve  levels  off  is  selected. 
For  example,  the  screeplot  in  Figure  5.1  suggests  that  although  the  7-parameter  model 
has  the  minimum  value  of  prediction  error,  little  improvement  is  gained  after  five 
parameters  are  included  in  the  model.  In  Section  5.3  we  describe  a  simulation 


151 


experiment  that  tests  several  resampling  prediction  error  methods  on  a  published  data  set 
using  both  model  selection  criteria:  the  absolute  minimum  prediction  error  and  our 
recommended  strategy.  The  results  in  Section  5.4  indicate  that  the  absolute  minimum 
criterion  is  not  required  and  effective  model  selection  is  possible  with  the  proposed 
heuristic.  Section  5.5  extends  the  simulation  to  higher  dimension  and  also  evaluates 
performance  when  the  signal-to-noise  ratio  is  not  high.  Section  5.6  explores  the 
usefulness  of  the  various  resampling  model  selection  schemes  in  the  presence  of  outlying 
observations  using  a  robust  regression  estimator.  Recommendations  and  conclusions  are 
offered  in  Section  5.7 


Number  of  Parameters 


Figure  5.1.  Representative  screeplot  of  aggregate  prediction  error. 
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5.2  Resampling  Measures  of  Prediction  Error 

The  two  classes  of  resampling  methods  currently  recommended  to  calculate  a 
measure  of  prediction  error  for  variable  selection  are  cross-validation  and  bootstrapping. 
Cross-validation  procedures  partition  the  data  into  two  disjoint  sets.  The  model  is  fit  with 
one  set  (the  training  set)  and  it  is  subsequently  used  to  predict  the  responses  for  the 
observations  in  the  second  set  (assessment  set).  Bootstrap  procedures  form  many 
samples  of  the  original  data  by  resampling  with  replacement.  Details  of  the  methods  and 
their  application  to  the  variable  selection  problem  in  regression  are  outlined  below. 

5.2.1  Cross-Validation  Procedures 

An  intuitively  appealing  method  to  calculate  a  predicted  response  value  is  to  use 
the  parameter  estimates  fi-om  the  fit  obtained  by  omitting  the  observation  to  be  predicted. 

n 

This  predicted  response  value  is  denoted  by  .  Then  A^k,!  =  ” 

1=1 

conq)uted  as  the  leave-one-out  cross-validation  estimate  of  average  prediction  error  for  a 
model.  Apart  fi-om  the  n‘  term,  this  quantity  is  the  predicted  error  sum  of  squares 
(PRESS)  statistic  in  least  squares  (Allen,  1971).  The  PRESS  statistic  can  be  calculated  as 

where  h„  =X((X'X)''x;.  Note  that  least  squares  does  not  require  n  separate  fits 

for  PRESS.  Other  regression  estimators  (e.g.  robust)  do  require  all  n  fits  for  the  leave- 
one-out  cross-validation  estimate  of  prediction  error.  Shao  (1993)  proves  with 
asymptotic  results  and  simulations  that  the  model  with  the  minimum  PRESS  statistic  or 
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leave-one-out  cross-validation  estimate  of  prediction  error  is  often  overspecified.  He 
recommends  using  K-fold  cross-validation  that  leaves  a  subset  of  observations  out. 

Quenouille  (1949)  ejqilored  the  idea  of  leaving  two  observations  out  of  the 
training  set  and  Stone  (1974)  extended  the  method  to  more  than  two.  In  K-fold  cross- 
validation,  the  training  set  omits  approximately  «/K  observations  fi-om  the  training  set 
rather  than  a  single  observation  like  PRESS.  To  predict  the  response  values  for  the  K 
assessment  set,  Sk,a,  all  observations  apart  fi-om  those  in  set  k  are  in  the  training  set,  Sk,t, 
and  these  are  used  to  estimate  the  model  parameters.  The  K-fold  cross-validation  average 

prediction  error  is  ^cv,k  -  SG-,  where  is  the  predicted  response  for 

1=1 

observation  i  belonging  in  assessment  set  Sk,a- 

One  approach  to  the  K-fold  cross-validation  estimate  of  prediction  error  is  to 
randomly  select  the  w/K  observations  to  form  the  assessment  set.  This  process  is  repeated 
numerous  times  and  the  prediction  errors  are  averaged.  Breiman  et  al.  (1984)  propose  a 
less  computationally  intense  scheme  that  randomly  partitions  the  data  into  K  different 
disjoint  sets.  Davison  and  Hinkley  (1997)  recommend  K  =  min  (w*^,  10)  in  practice. 

This  procedure  decreases  the  variance  of  prediction  error  over  that  of  the  leave-one-out 
cross-validation  estimate  but  at  the  e)q)ense  of  increased  bias.  Surprisingly,  Shao  (1993) 
demonstrates  that  the  smaller  the  training  set,  the  better  the  K-fold  estimate  is  for  model 
selection. 

To  reduce  the  bias,  Burman  (1990)  recommends  the  adjusted  K-fold  cross- 
validation  estimate  of  prediction  error  as 


+  W  where  A  is  the  ratio  of  observations 

*=i  V  >='  / 

in  assessment  set  A: to  the  total  n  and  ;p(t  ^jis  the  predicted  response  for  the  /'*  observation 

from  the  fit  with  training  set  St,  *.  The  Breiman  and  Spector  (1992)  simulations 
demonstrate  that  the  performance  of  the  adjusted  cross-validation  prediction  error 
estimate  is  slightly  worse  than  the  standard  biased  K-fold  cross-validation  prediction 
error  for  least  squares  variable  selection.  Shao  (1993)  shows  that  both  the  leave-one-out 
and  K-fold  cross-validation  procedures  have  a  negligible  probability  of  selecting  an 
underspecified  model.  The  challenge  is  avoiding  an  overfit  model. 

5.2.2  Bootstrap  Procedures 

Bootstrap  estimators  in  regression  have  received  considerable  attention  in  the 
literature  since  their  introduction  by  Efron  (1979).  Wu  (1986)  provides  the  theoretical 
results  for  bootstrap  methods  applied  to  regression.  Hall  (1989)  proves  that  inference 
procedures  in  regression,  such  as  confidence  intervals,  based  on  the  bootstrap  estimate 
are  more  accurate  than  standard  inference  procedures  even  if  the  error  is  Gaussian. 

The  ftmdamental  element  of  a  bootstrap  procedure  is  the  bootstrap  sanqile.  For 
bootstrapping  pairs  in  regression  (Efron,  1982),  the  sample  is  formed  by  randomly 
sampling  with  replacement  n  times  both  a  response  value  and  its  associated  vector  of 
regressor  variable  values  from  the  original  sample.  The  bootstrap  sample  may  contain  an 
observation  from  the  original  sample  once,  multiple  times  or  not  at  alL  In  fact,  the 
probability  that  an  observation  is  included  in  a  bootstrap  sample  of  size  n  is  1  -  e'*  = 
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0.632  (Efron  and  Tibshirani,  1997).  A  regression  model  is  then  fit  to  the  bootstrap 

sample  to  obtain  the  bootstrap  parameter  estimates  3  * .  A  large  number  of  bootstrap 
samples  {B  >  100,  Davison  and  Hinkley,  1997)  are  constructed  from  the  original  sample 
for  model  inference. 

For  the  variable  selection  problem,  the  estimate  of  the  average  prediction  error  for 

n 

the  y*  bootstrap  sample  is  A*  =  **^®  original 


th 

sample.  Efron  (1983)  provides  the  unbiased  estimator  of  prediction  error  for  the  b 
sample  as  where 

i=i  /=i  <=i 

X*  is  the  vector  of  regressor  values  for  the  observation  in  the  bootstrap  sample.  The 
overall  unbiased  bootstrap  estimate  of  average  prediction  error  is  simply 

®  A  ■ 

Agj  =  Aj  ^biased  where  B  is  the  number  of  bootstrap  samples.  Shao  (1996)  shows 


that  selecting  the  model  with  the  minimum  Agg  is  inconsistent.  Inconsistency  implies 


that  the  probability  the  true  model  has  the  minimum  bootstrap  average  prediction  error 
does  not  equal  1.0  as  n  approaches  infinity.  Shao  corrects  this  inconsistency  for 
bootstrapping  pairs  by  using  substantially  fewer  than  n  observations  to  construct  the 
bootstrap  samples.  This  procedure  uses  the  biased  estimate  of  prediction  error.  Breiman 
(1996),  motivated  by  the  0.632  probability  that  an  observation  is  selected  in  a  bootstrap 
sample,  notes  that  using  bootstrap  samples  of  size  2n  has  little  effect  on  OLS  variable 


selection. 
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5.2.3  Other  Modifications  to  Resampling  Methods  for  Variable  Selection 

Breiman  and  Spector  (1 992)  explore  the  use  of  cost  admissibility  (penalty  for 
adding  variables)  with  bootstrap  and  cross-validation  prediction  error  for  variable 
selection.  Their  empirical  results  indicate  that  this  modification  only  slightly  increases 
the  probability  of  selecting  the  correct  model.  This  is  an  important  resuh  because  most 
resampling  estimates  of  prediction  error  do  not  accoxmt  for  the  number  of  variables  in  the 
model. 

Breiman  (1992)  recommends  the  little  bootstrap  estimate  of  prediction  error  for 
variable  selection  in  linear  models.  The  prediction  error  for  a  k  variable  model  using  this 
approach  is  k^pp{k)  +  2B,(k) .  The  Uttle  bootstrap  error,  B,{k) ,  is  the  resubstitution  error 

from  the  model  selected  using  y*  =  y  +r  where  e  is  the  vector  of  variates  from  NID  (0, 

/  )  with  0.6  <  r  <  0.8.  The  MSe  for  the  full  model  is  used  as  an  estimate  of  cr^ . 

Breiman  shows  that  the  little  bootstrap  is  unbiased  and  superior  to  Cp,  F-to-enter,  and  F- 
to-leave  for  variable  selection  for  fixed  designs. 

Breiman  (1996)  suggests  bagging  (bootstrap  aggregating)  regressor  variables. 

For  each  of  the  B  samples  formed  by  bootstrapping  pairs,  perform  a  forward  selection  to 
obtain  a  1  variable  model,  2  variable  model,  ...k  variable  model.  The  nxk matrices  of 
predicted  values  from  these  k  models  are  averaged  across  the  B  bootstrap  samples.  The 
model  with  the  lowest  average  prediction  error  is  selected.  Limited  simulation  results 
indicate  that  this  procedure  performs  better  than  standard  forward  selection.  It  is  unclear 
how  to  proceed  if  the  same  variables  are  not  consistently  selected  in  the  B  samples  for  a 
given  dimension. 
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Davison  and  Hinkley  (1997)  describe  a  hybrid  estimate  of  bootstrap  prediction 
error  for  variable  selection  adapted  jfrom  Efron  and  Tibshirani  (1997).  The  hybrid 
estimate  of  prediction  error  weights  the  apparent  error  and  the  bootstrap  cross-validation 
(BCV)  error.  The  BCV  error  is  calculated  from  the  predicted  and  observed  values  of 
those  observations  not  included  in  the  bootstrap  sample.  The  recommended  weights  from 
theory  and  practice  are  0.632  for  the  BCV  error  and  (1-0.632)  for  the  apparent  error.  The 
authors’  empirical  evidence  suggests  that  this  procedure  is  superior. 

5.3  An  Alternative  Criterion  for  Variable  Selection 

Recent  results  indicate  that  many  of  the  classical  measures  used  for  variable 
selection  such  as  R^,  adjusted  R^,  Cp,  and  PRESS,  are  highly  biased  and  not  suitable  for 
variable  selection.  The  computer  selection  methods  of  forward,  backwards,  and  stepwise 
are  based  on  these  measures  and  also  provide  biased  results.  Many  of  the  arguments 
against  these  procedures  are  derived  from  asymptotic  properties  and  assume  that  the 
candidate  model  with  the  minimum  (or  maximum  for  R^  measures)  value  of  the  statistic 
is  selected.  We  believe  satisfactory  results,  and  in  many  cases  superior  results,  are 
possible  by  relaxing  the  requirement  of  selection  by  the  minimum  value  of  the  statistic. 
Rather,  one  would  select  the  model  that  has  a  low  prediction  error  with  the  fewest 
variables.  This  procedure  is  a  more  realistic  representation  of  what  a  practitioner  is  likely 
to  do  given  the  prediction  errors  from  the  candidate  models.  Obviously,  there  is  more 
subjectivity  with  this  criterion  than  simply  selecting  the  model  with  the  minimum 
prediction  error. 
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A 

To  illustrate  the  methodology,  the  average  prediction  errors  ( Agj  )  from  100 
bootstrap  samples  each  for  models  with  an  increasing  number  of  active  variables  are 
displayed  in  Table  5.1.  The  model  is  y  =  Xp  +e  where  X  is  the  design  matrix  formed 
by  augmenting  the  four  regressor  variables  from  the  Gunst  and  Mason  (1980)  data  with  a 
column  of  ones,  P  is  the  known  vector  of  parameters  and  e  is  the  vector  of  NID  ) 

error  terms.  This  data  set  (shown  in  the  S-Plus  code  in  Appendix  C)  is  used  extensively 
in  the  Shao  (1993  and  1996)  studies  and  in  sections  5.4  through  5.6.  The  column 
headings  of  Table  5.1  display  the  known  generating  vector  P  used  to  calculate  the 
response  values.  Our  procedure  looks  at  the  change  in  prediction  error  going  from  a 
model  of  dimension  j  to  dimension y  +  1 .  If  there  is  only  a  slight  decrease,  then  the 
smaller  dimension  model  is  preferred.  In  the  second  column  of  Table  5.1,  our  strategy 
would  correctly  choose  the  2-parameter  model  (intercept  and  )  rather  than  the  model 

with  the  minimum  prediction  error;  the  5-parameter  model.  Similarly,  the  proposed 
method  would  select  the  correct  model  (the  shaded  cells)  for  the  other  columns. 


Table  5.1.  Average  prediction  error  from  100  bootstrap  samples  as  a  function  of  the 


number  of  variables  in  the  model.  The  column  headings  are  i 

the  true  model. 

P 

■MilillilB 

■SHiSI 

[2,9,6,4,81 

1 

18.51 

130.03 

188.90 

266.01 
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0.96 

15.75 

22.07 

22.27 

3 

1.01 

0.89 

3.36 

9.35 

4 

1.05 

0.95 

0.95 

4.02 

5 

0.95 

0.96 

0.93 

0.96 
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In  practice,  the  proposed  criterion  requires  subjective  judgment.  For  simulation 
studies,  we  must  specify  the  minimum  change  in  prediction  error  required  to  select  the 
next  higher  dimension  model.  We  follow  the  impurity  logic  used  to  split  and  terminate 
nodes  in  Classification  and  Regression  Trees  (Breiman  et  al.,  1984).  If  the  change  in 
prediction  error  does  not  exceed  a  certain  percentage  of  the  prediction  error  for  the  null 
model  (intercept  only),  then  the  lower  dimension  model  is  selected.  For  example,  if  our 
minimum  change  in  prediction  error  criterion  were  1%,  then  the  difference  in  prediction 
error  must  be  at  least  0.185  between  models  of  sizey  andj  +  1  for  the  first  column  in 
Table  5.1.  We  are  not  advocating  using  a  specific  percentage  as  much  as  carefully 
inspecting  the  prediction  errors  between  candidate  models.  The  percentages  are  useful 
for  comparative  studies  in  simulations. 

5.4  A  Simulation  Study 

The  simulation  scenarios  reported  in  Shao  (1996)  provide  an  ideal  test  bed  for  the 
proposed  change  in  prediction  error  criterion.  The  regressor  variables  are  those  from  the 
Gunst  and  Mason  (1980)  data  set  ivith  w  =  40  cases  and  responses  generated  as  described 
in  Section  5.3.  Some  of  the  constants  in  g  are  0;  therefore,  the  objective  of  the  study  is  to 
assess  several  resampling  methods’  ability  to  correctly  identify  the  active  set  of  regressor 
variables  using  both  the  minimum  prediction  error  and  the  proposed  change  in  prediction 
error  criteria.  Shao’s  objective  is  to  demonstrate  that  using  a  much  smaller  bootstrap 
sample  size  than  n  leads  to  consistent  variable  selection  for  the  minimum  prediction  error 
criterion  while  all  other  techniques  (PRESS,  Cp,  and  the  bootstrap  with  full  sample)  have 
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considerably  less  capability  to  select  the  proper  model.  A  useful  outcome  of  our  study 
would  be  to  demonstrate  that  these  inconsistent,  yet  commonly  used,  techniques  can 
provide  reliable  model  selection  if  we  change  the  criterion. 

5.4.1  Simulation  Details 

The  1000  simulation  replicates  generate  the  data  sets  exactly  as  in  Shao  (1996). 
That  is,  the  Gimst  and  Mason  set  of  regressor  variables  and  the  response  values 
calculated  by  specifying  the  known  parameters  and  adding  a  vector  of  standard  normal 
variates  to  Xp  .  The  measure  of  effectiveness  for  a  procedure  is  the  proportion  of  the 
1000  replicates  that  the  correct  model  of  known  dimension/  is  selected.  The  resampling 
estimates  of  prediction  error  used  are  the  leave-one-out  cross-validation  estimate,  the  K- 
fold  cross-validation,  the  adjusted  K-fold  cross-validation,  the  bias-corrected  bootstrap  of 
using  sample  size  n,  and  the  bootstrap  with  sample  size  nil.  Following  the  Davison  and 
Hinkley  (1997)  recommendations,  the  value  of  K  is  6  and  the  number  of  bootstrap 
samples,  B,  is  fixed  at  100  per  replicatioa  The  prediction  errors  for  these  100  bootstrap 
samples  are  averaged  and  then  the  model  selection  criterion  (minimum  prediction  error  or 
change  in  prediction  error)  is  applied.  For  the  change  in  prediction  error,  we  run  pilot 
studies  to  find  reasonable  values  for  the  constant  defined  as  the  percentage  of  null  model 

prediction  error.  We  follow  Shao’s  suggestion  that  it  is  not  practical  to  evaluate  all 


possible  models  and  also  evaluate  one  model  in  each  dimension. 
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5.4.2  Simulation  Results 

All  simulation  results  are  summarized  in  tables.  The  known  correct  model  is 
shaded  in  the  tables.  The  proportion  of  simulation  replicates  that  the  minimum  prediction 
error  criterion  selects  the  model  is  the  first  entry  in  each  cell.  The  proportion  from  the 
proposed  criterion  is  the  second  entry  in  each  cell. 

Table  5.2  displays  the  results  for  the  model  with  the  intercept  and  active.  All  five 

procedures  have  above  a  98%  chance  of  correctly  identifying  the  true  model  with  the 
proposed  selection  criterion.  This  suggests  that  a  practitioner  comparing  models  with  the 
PRESS  statistic  would  likely  have  made  the  correct  choice  upon  careful  examination  of 
the  prediction  errors.  Consistent  with  the  results  reported  by  Shao  (1996),  the  minimum 
prediction  error  criterion  has  difficulty  with  overfitting  for  most  methods.  No  prediction 
error  method  except  the  bootstrap  half  sample  reliably  selects  the  correct  model  under  the 
Tninitniim  prediction  error  criterion.  A  surprising  result  is  that  only  50.9%  of  the  100,000 
bootstrap  samples  (100  bootstrap  samples  x  1000  replicates)  selected  the  correct  model 
with  the  minimum  prediction  error  criteria.  Yet,  when  prediction  error  is  averaged  over 
the  100  bootstrap  samples,  the  correct  model  is  selected  in  approximately  95%  of  the 
1000  replicates. 
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Table  5,2.  Results  for  1000  simulation  replicates  for  model  selection  with  2 
active  parameters  in  the  Shao  (1996)  data  sets.  The  selection  percentages  for  the  true 
model  [2,  0, 0, 4, 0]  are  shaded.  The  top  number  in  each  cell  is  the  proportion  of 
replications  that  the  model  was  selected  using  the  minimization  of  prediction  error 
criterion  and  the  bottom  uses  the  proposed  change  in  prediction  error  criterion  with 


constant  .025.  The  values  in  brackets  are  the  proportion  of  100,000  bootstrap 


samoles  that  the  model  is  selected.  Results  are  accurate  to  approximately  ±0.03. 
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The  results  in  Tables  5.3  and  5.4  for  the  3  and  4  active  parameter  models, 
respectively,  further  confirm  the  superiority  of  the  change  in  prediction  error  criterion. 
There  is  virtually  a  100%  chance  of  selecting  the  correct  model  with  the  proposed 
criterion  independent  of  the  resampling  procedure  choice.  Contrary  to  most  other 
published  results,  the  leave-one-out  cross-validation  estimate  slightly  outperforms  the  K- 
fold  and  adjusted  K-fold  methods  for  the  minimum  prediction  error  criteria.  Also,  the 
adjusted  K-fold  method  is  slightly  worse  than  the  K-fold  which  agrees  with  Breiman  and 
Spector  (1992).  The  results  in  Table  5.5  with  all  5  parameters  active  favor  the  minimum 
prediction  error  criterion.  This  is  not  unexpected  because  this  criterion  rarely  selects  an 
underspecified  model. 
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Table  5.3.  Results  for  1000  simulation  replicates  for  model  selection  with  3  active 
parameters  in  the  Shao  (1996)  data  sets.  The  selection  percentages  for  the  true  model 
[2,  0, 0, 4,  8]  are  shaded.  The  top  number  in  each  cell  is  the  proportion  of 
replications  that  the  model  was  selected  using  the  minimization  of  prediction  error 
metric  and  the  bottom  uses  the  change  in  prediction  error  criteria  with  constant  .025. 
The  values  in  brackets  are  the  proportion  of  100,000  bootstrap  samples  that  the 


model  is  selected.  Results  are  accurate  to  approximately  ±0.03. 
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Table  5.4.  Results  for  1000  simulation  replicates  for  model  selection  with  4  active 
parameters  in  the  Shao  (1996)  data  sets.  The  selection  percentages  for  the  true  model 
[2, 9, 0, 4,  8]  are  shaded.  The  top  number  in  each  cell  is  the  fraction  of  replications 
that  the  model  was  selected  using  the  minimization  of  prediction  error  criterion  and 
the  bottom  uses  the  change  in  prediction  error  criterion  with  constant  .025.  The 
values  in  brackets  are  the  proportion  of  100,000  bootstrap  samples  that  the  model  is 
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Table  5.5.  Table  5.3.  Results  for  1000  simulation  replicates  for  model  selection  with 
3  active  parameters  in  the  Shao  (1996)  data  sets.  The  selection  percentages  for  the 
true  model  [2,  9, 6, 4, 8]  are  shaded.  The  top  number  in  each  cell  is  the  proportion  of 
replications  that  the  model  was  selected  using  the  minimization  of  prediction  error 
criterion  and  the  bottom  uses  the  change  in  prediction  error  criterion  with  constant 
.025.  The  values  in  brackets  are  the  proportion  of  100,000  bootstrap  samples  that  the 


model  is  selected.  Results  are  accurate  to  approximately  +0.03 
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5.5  Extensions  to  Noisy  and  High-Dimension  Data  Sets 

The  minimum  change  in  prediction  error  is  the  superior  criterion  for  any 
resampling  method  in  the  data  sets  used  in  Shao  (1996).  The  success  of  the  procedure 
may  be  attributed  to  the  low  dimension  of  the  data  (4  regressor  variables),  the  high 
signal-to-noise  ratio  or  possibly  a  combination  of  both.  We  conduct  some  substudies  in 
this  section  to  forther  characterize  the  performance  of  both  variable  selection  criteria. 


5.5.1  High-Dimension  Data 

We  modify  the  Gunst  and  Mason  data  set  of  regressor  variables  to  include  5 
additional  variables  whose  levels  are  generated  from  a  NID  (0, 0.7^)  distribution.  This 
approximately  matches  the  current  levels  for  the  majority  of  the  original  4  regressors. 


The  five  additional  variables  have  no  effect  on  the  generated  response  values  as  the 
p '  vector  is  now  [2, 9, 6, 4, 8, 0, 0, 0, 0, 0].  The  response  variable  is  still  generated  as 

y  =  Xp  +e  where  e  is  NID  (0,cr^I )  with  cr^  =  1 . 


165 


The  probabilities  in  Table  5.6  again  indicate  that  the  change  in  prediction  error 
criterion  performs  better  than  the  minimum  prediction  error  criterion.  The  minimum 
prediction  error  criterion  overfits  models  except  with  the  bootstrap  resampling  method 
using  half  samples.  Note  that  in  only  42.8%  of  the  100,000  bootstrap  samples  was  the 
correct  model  selected  under  this  criterion  for  the  bootstrap  half  sample  method.  In 
contrast,  the  proposed  criterion  selects  the  correct  model  in  over  80%  of  the  bootstrap 
samples  using  the  full  sample.  The  change  in  prediction  criterion  produces  the  following 
important  findings:  1)  the  bootstrap  using  the  fixll  sample  is  best,  2)  the  leave-one-out 
cross-validation  outperforms  the  K-fold  cross-validation  procedures,  3)  the  bias  adjusted 
K-fold  is  slightly  preferred  to  the  ordinary  K-fold  estimate  of  prediction  error,  and  4)  any 
resampling  method  has  a  high  probability  of  selecting  the  correct  model. 
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Table  5.6.  Results  for  1000  simulations  for  model  selection  from  Shao  (1996).  The  first 
four  regressors  are  the  Gunst  and  Mason  data  and  the  last  five  regressors  are  variates 
fromNID  (0,0.7^)  and  e  is  NID  ).  The  true  model,  3  '  =  [2, 9,  6, 4,  8, 0,  0, 0,  0, 0], 
is  shaded.  The  top  number  in  each  cell  is  the  proportion  of  replicates  the  model  was 
selected  using  the  minimization  of  prediction  error  criterion  and  the  bottom  uses  the 
change  in  prediction  error  criterion  with  constant  .001 .  The  values  in  brackets  are  the 
proportion  of  100,000  bootstrap  samples  that  the  model  is  selected.  Results  are  accurate 


to  approximately  +0.03 
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5.5.2  High-Dimension  and  Noisy  Data 

All  of  the  previous  correctly  specified  models  (the  shaded  models  in  Tables  5.2  - 
5.6)  are  highly  significant.  That  is,  the  signal-to-noise  ratio  is  high  as  evidenced  by  the 
R^  values  ranging  from  0.98  and  0.995.  We  would  typically  not  expect  to  see  such  high 
R^  values  in  practice.  Also,  there  is  a  peculiarity  in  the  Gimst  and  Mason  data.  Section 
5.6  details  that  there  are  12  observations  out  of  the  total  40  in  extreme  X-space  (high- 
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leverage).  For  these  reasons,  we  temporarily  abandon  the  Gunst  and  Mason  data.  This 
sub-study  evaluates  the  performance  of  the  resampling  algorithms  and  the  two  model 
selection  criteria  when  the  signal-to-noise  ratio  and  remoteness  in  X-space  are  not  as 
extreme. 

The  artificial  data  set  generates  standard  normal  variates  for  the  design  matrix,  X, 
of  dimension  n  =  40  observations  and  ^  =  9  regressor  variables  augmented  with  a  column 
of  ones.  For  this  study,  the  design  matrix  changes  for  each  of  the  1000  replications  to 
represent  the  X-random,  as  opposed  to  X-fixed,  case  for  regression.  Breiman  and  Spector 
(1992)  state  that  significant  differences  exist  between  the  two  assumptions  with  respect  to 
variable  selection  and  the  X-random  designs  are  appropriate  for  most  analysis.  The 
response  variable  is  generated  as  usual,  y  =  Xp  +e  except  that  e  is  NID  (0,cr^I )  with  o 
=  10.  The  known  vector  of  parameters  is  the  same  as  the  previous  ejq)eriment,  P  '  =  [2, 9, 
6, 4,  8, 0, 0, 0, 0, 0].  The  values  with  the  new  distribution  of  the  error  term  and  design 
matrix  range  between  0.65  and  0.75.  This  amount  of  noise  in  the  data  is  more  reahstic 
for  many  applications. 

The  minimum  prediction  error  criterion  (the  values  on  top  of  each  cell  in  Table 
5.7)  have  similar  results  to  the  change  in  prediction  error  criterion  with  constant  equal  to 
0.001  (middle  values  in  each  cell).  Both  criteria  do  not  reliably  identify  the  correct 
model  using  cross-validation  or  the  bootstrap  with  the  foil  sample  size.  The  only 
procedure  that  does  not  consistently  lead  to  overfit  models  is  the  bootstrap  using  a  san^le 


size  of «  =  20. 
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One  possible  solution  to  make  these  resampling  procedures  more  reliable  is  to 
increase  the  constant  from  0.001 .  This  constant  value  suggests  that  the  models  of  the 
iMxt  higher  dimension  will  be  selected  if  the  change  in  prediction  error  exceeds  1/10  of 
1%  of  the  prediction  error  from  the  null  model.  The  middle  value  of  each  cell  in  Table 
5.7  indicates  that  we  rarely  select  models  that  are  underfit.  Assumption  of  more  risk  of 
underfitting  by  increasing  the  value  of  the  constant  can  lead  to  a  higher  probability  of 
correct  model  selection.  The  last  value  in  each  cell  of  Table  5.7  is  the  proportion  of 
replications  that  the  change  in  prediction  error  criterion  selects  the  model  if  the  constant 
is  changed  to  0.03.  The  constant,  calculated  from  pilot  studies,  is  set  to  achieve  a  balance 
between  underfit  and  overfit  models.  The  best  prescription  still  appears  to  be  the 
bootstrap  with  the  half  sample  size  for  either  criterion.  However,  the  cross-validation  and 
bootstrap  with  the  foil  sample  are  competitive  if  the  change  in  prediction  error  criterion  is 
used. 

Note  that  the  proportions  in  Tables  5.6  and  5.7  for  the  proposed  change  in 
prediction  error  criterion  are  conservative  for  the  correct  (shaded)  model.  The 
programming  logic  does  not  adequately  address  the  situations  when  there  are  significant 
drops  in  prediction  error  in  higher  dimension  models  but  the  prediction  error  is  still 
greater  than  the  correctly  specified  model.  To  illustrate,  consider  the  following  vector  of 

average  prediction  errors  A  =  [1000, 500, 100, 50, 15, 14, 45, 60, 25, 30].  The  correct 
model  is  the  5-parameter  model  with  prediction  error  15.  The  change  in  prediction  error 
criterion  with  constant  0.03  as  programmed  selects  the  9-parameter  model  because  the 
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prediction  error  has  decreased  by  more  than  30  (0.03  *  1000).  It  is  difficult  to  capture  the 
subjective  nature  of  the  process;  however,  the  logic  errs  to  the  conservative  side. 


Table  5.7.  Results  for  1000  simulation  replicates  for  model  selection.  All  regressor 
variable  values  are  generated  from  a  standard  normal  distribution.  The  response  is 
generated  from  the  vector  p'  =  [2,  9, 6, 4,  8,  0,  0, 0, 0, 0]  with  e  NID  (0,<t^I)  and  a 
=  10.  The  top  value  in  each  cell  is  the  proportion  of  the  1000  replicates  the  model 
was  selected  using  the  minimization  of  prediction  error  criterion,  the  middle  value  is 
the  change  in  prediction  error  criterion  with  constant  0.001,  and  the  bottom  value  is 
the  change  in  prediction  error  criterion  with  constant  0.03.  The  values  in  brackets 
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5.6  Variable  Selection  in  the  Presence  of  Outliers 

The  complexity  of  variable  selection  significantly  increases  for  regression  models 
contaminated  with  outliers.  Markatou  et  al.  (1991)  state  that  tests  on  least  squares 
regression  parameters  lose  power  dramatically  in  the  presence  of  outliers  and  leverage 
points.  One  approach  to  overcome  the  loss  of  power  is  to  use  a  robust  regression 
estimator.  The  previous  chapters  illustrate  that  least  squares  is  not  the  estimator  of  choice 
in  contaminated  samples  and  that  compound  estimators  demonstrate  the  best  overall 
performance.  This  section  reviews  the  robust  regression  variable  selection  literature  for 
both  analytical  and  resampling  methods  and  conducts  comparative  evaluations  of 
resampling  methods  with  compomd  estimators.  There  are  few  empirical  results  in  the 
literature  that  address  the  combined  problem  of  compound  estimation  and  resampling 
because  both  procedures  are  computationally  complex. 

5.6.1  Variable  Selection  with  Robust  Regression  Estimators 

Although  numerous  robust  estimators  have  been  proposed  in  the  last  25  years, 
there  are  significantly  fewer  results  in  the  literature  that  explore  variable  selection 
procedures  in  the  robust  regression  model.  Most  robust  regression  variable  selection 
methods  are  based  on  robust  versions  of  the  general  linear  test  that  use  the  asymptotic 
covariance  matrix  (Hampel  et.  al,  1986).  Markatou  and  He  (1994)  and  Hertier  and 
Ronchetti  (1994)  extend  the  Wald  (similar  to  /-tests)  and  drop-in-dispersion  tests  (similar 
to  F-tests)  to  GM  and  compound  estimators.  Field  (1997)  and  Field  and  Welsh  (1998) 
propose  saddlepoint  approximations  of  tail  area  probabilities  for  robust  regression 
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hypothesis  testing  as  improvements  to  the  asymptotic  approach.  The  results  are  mixed 
and  they  recommend  further  testing  in  finite  samples.  Ronchetti  and  Staudte  (1994) 
propose  a  robust  version  of  Mallows’s  Cp.  The  method  multiplies  the  squared  residuals 
by  the  final  weights  from  a  robust  fit  to  compute  the  residual  sum  of  squares.  Two 
additional  quantities  are  also  added  to  the  residual  sum  of  squares  that  are  a  function  of 
the  number  of  parameters  and  the  selected  robust  estimator.  The  robust  Cp  appears  to 
work  satisfactorily  for  their  three  examples,  but  no  simulation  results  are  reported. 

The  Wald  test  is  currently  preferred  (Hertier,  1997)  because  of  its  asymptotic  chi- 
square  distribution  and  the  relative  ease  to  calculate  the  asymptotic  covariance  matrix. 
Wilcox  ( 1 997)  ejq)eriments  (results  not  reported)  with  the  Wald  test  using  the  M- 
estimator  and  the  Coakley  and  Hettmansperger  (1993)  compound  estimator.  For  both 
estimators,  he  found  poor  control  over  the  Type  I  error,  even  with  normal  error  terms  and 
w  =  100.  All  authors  conclude  that  it  is  important  to  do  further  testing  and  evaluation  to 
understand  the  strengths  and  weaknesses  of  the  methods  in  finite  samples. 

Bootstrap  methods  can  be  used  in  robust  regression  to  construct  confidence 
intervals  and  prediction  intervals  (Efron  and  Tibshirani,  1993,  Davison  and  Hinkley, 

1997,  Wilcox,  1994, 1996a,  1996b,  1997).  Mammen  (1993)  shows  the  consistency  of  the 
bootstrap  for  linear  tests  with  the  M  estimator. 

Wilcox  (1997,  1998)  presents  an  interesting  approach  to  the  variable  selection 
problem  in  robust  regression  by  using  a  bootstrap  resampling  scheme.  He  uses  a 
percentile  bootstrap  approach  to  find  critical  values  for  the  joint  confidence  region  on  the 
Mahalanobis  distance  for  the  model  parameters.  The  steps  of  the  algorithm  are: 
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1 .  Obtain  B  bootstrap  estimates  of  p  by  bootstrapping  pairs. 

2.  Estimate  the  covariance  matrix  V  using  all  B  bootstrap  estimates  of  3  • 

3.  Find  the  Mahalanobis  distance  of  (3  ‘  -P  )  using  for  each  bootstrap 

A 

sample  where  3  *  is  the  bootstrap  estimate  of  the  model  parameters  and  3  is  the 
vector  of  parameter  estimates  from  the  original  data. 

4.  Sort  the  Mahalanobis  distances  and  call  the  (1  -a)B  ordered  distance  the 
critical  value. 

5.  Find  the  test  statistic  by  the  Mahalanobis  distance  using  V'  of  (3  -  c  )  where  c 
is  a  vector  of  constants  often  selected  as  0  to  test  for  significance. 


Wilcox  (1998)  states  that  there  is  room  for  improvement  with  this  method  because 
the  probability  of  a  Type  I  error  can  be  substantially  less  than  nominal  levels  in  many 
circumstances.  He  cautions  that  this  approach  does  not  work  well  with  least  squares; 
correction  factors  through  simulation  are  required  to  achieve  the  correct  coverage 
probabilities.  Our  experiments  with  conqjoimd  estimators  indicate  that  the  algorithm  is  a 
dependable  diagnostic  to  test  if  at  least  one  of  the  variables  is  active;  however,  the  test 
statistics  are  not  useful  to  differentiate  between  competing  models. 

Davison  and  Hinkley  (1997)  provide  a  brief  discussion  of  resampling  methods  in 
robust  regression.  Their  guidance  on  resampling  methods  for  variable  selection  in  robust 
regression  focuses  on  two  main  points:  1)  remove  gross  outliers  from  the  analysis 
because  too  many  outliers  could  appear  in  the  resampled  data  leading  to  inefiBciency  and 
breakdown  and  2)  most  of  the  prediction  error  methods  for  least  squares  should  apply  to 
robust  regression.  Outliers  are  removed  by  residuals  from  an  LTS  fit. 

Thus,  there  is  relatively  little  guidance  for  variable  selection  using  cross- 
validation  or  bootstrap  estimates  of  prediction  error  in  robust  regression.  The  next 
section  revisits  and  modifies  the  Gunst  and  Mason  data  to  contain  residual  outliers.  We 
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compare  the  same  resampling  methods  and  criteria  as  in  Section  5.5,  except  that  we  use  a 
compoimd  estimator  rather  than  least  squares. 


5.6.2  Modified  Gunst  and  Mason  Data 

The  Gunst  and  Mason  data  used  in  the  previous  sections  have  several 
observations  that  are  extreme  in  X-space.  The  hat  diagonals  indicate  that  only  4  of  the  40 
observations  (2,  8, 15,  and  39)  are  remote  in  X-space  using  the  usual  3pln  criteria 
(Hoaglin  and  Welsch,  1978),  However,  the  Rocke  and  Woodruff  (1996)  robust  distances 
(see  Chapter  4)  in  Table  5.8  conclude  that  the  12  shaded  observations  are  remote  in  X- 
space.  These  high-leverage  points  could  also  explain  the  large  values  seen  in  Section 
5.4.  In  practice,  the  response  values  of  these  extreme  points  in  X-space  may  not  follow 
the  regression  surface  as  well  as  in  the  previous  experiments.  We  plant  four  residual 
outliers  by  adding  10.0  to  the  response  values  of  the  high-leverage  observations  8,  15,  28, 
and  39.  The  data  set  now  contains  10%  residual  outliers  at  a  distance  of  1  Oct.  The 
simulations  are  run  exactly  as  described  in  Section  5.4  and  P '  =  [2,  0,  0,  4,  8]. 


Table  5.8.  Rocke  and  Woodruff  (1996)  robust  distances  for  the  Gunst  and  Mason  (1980) 
data.  The  observations  with  shaded  robust  distance  cells  are  considered  remote  in  X- 
space  because  they  exceed  the  cutoff  value  of  10 
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The  probabilities  in  Table  5.9  indicate  that  no  resampling  technique  using  any 
criterion  successfully  selects  the  correct  three-parameter  model.  The  two  inactive 
variables  are  now  significant  in  model  selection  because  the  least  squares  estunator  has 
used  them  to  fit  the  outliers.  This  example  illustrates  the  important  linkage  between 
outlier  identification  and  variable  selection  in  model  building.  The  planted  outliers  are 
masked;  they  do  not  have  unusual  residual  values.  Note,  also  from  Table  5.9,  that  the 
minimum  prediction  error  criterion  most  often  selects  the  5  parameter  model  while  the 
change  in  prediction  error  criterion  selects  the  4  parameter  model.  The  least  squares 
estimator  has  failed;  outliers  are  masked  and  insignificant  variables  now  appear  to  be 
significant. 

A  logical  choice  for  this  data  set  contaminated  with  high-leverage  points  and 
residual  outliers  is  a  compound  estimator.  Resampling  estimates  of  prediction  error  with 
compound  estimators  could  potentially  pose  some  problems.  For  example,  the  estimator 
could  breakdown  because,  by  chance,  a  bootstrap  sample  may  contain  too  many  of  the 
planted  outliers.  Breakdown  means  that  the  parameter  estimates  are  no  longer  valid  for 
the  bulk  of  the  data  (see  Chapter  2).  Also,  if  the  compound  estimator  successfully 
downweights  the  outliers,  the  resulting  prediction  error  may  not  necessarily  be  low 
relative  to  the  other  models.  This  is  explained  by  the  existence  of  two  sources  of  error 
contributing  to  overall  prediction  error.  One  source  of  error  is  the  lack  of  a  good  fit  due 
to  model  misspecification.  The  other  source  is  that  the  estimator  works  and  assigns  large 
residual  values  to  the  outliers  which  inflates  the  overall  prediction  error. 


Table  5.9.  Results  for  1000  simulations  for  model  selection  from  Shao  (1996) 
using  least  squares  parameter  estimates.  Observations  8, 15, 28,  and  39  are  made 
residual  outUers.  The  true  model  [2,  0,  0, 4,  8]  is  shaded.  The  top  number  in  each 
cell  is  the  fraction  of  time  the  model  was  selected  using  the  minimization  of 
prediction  error  metric  and  the  bottom  uses  the  change  in  purity  metric  with  constant 
0.025.  The  values  in  brackets  for  the  bootstrap  are  the  ratios  out  of  100,000 


bootstrap  samples  that  the  model  is  selected. 
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The  values  in  Table  5.10  are  the  proportion  of  100  replicates  that  the  model  was 
selected  if  the  Simpson  and  Montgomery  (1998a)  compound  estimator  replaces  least 
squares.  The  change  in  prediction  error  criterion  (constant  =  0.025)  reliably  identifies  the 
correct  model  for  all  resampling  methods  with  a  slight  edge  given  to  the  bootstrap  full 
sample.  The  minimum  prediction  error  criterion  is  not  useful  except  for  the  bootstrap. 
Note  that  the  minimum  prediction  error  criterion  performs  poorly  with  the  full  sample 


bootstrap. 
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Table  5.10.  Results  for  100  simulations  for  model  selection  from  Shao  (1996)  using 
the  Simpson  and  Montgomery  conqwund  estunator.  Observations  8, 15, 28,  and  39  are 
made  residual  outliers  and  the  estimator  is  Simpson  and  Montgomery.  The  true  model  [2, 
0, 0, 4,  8]  is  shaded.  The  top  number  in  each  cell  is  the  fraction  of  time  the  model  was 
selected  using  the  minimization  of  prediction  error  metric  and  the  bottom  number  uses 
the  change  in  purity  metric  with  constant  0.025.  The  values  in  brackets  for  the  bootstrap 
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As  an  alternative  to  resampling  with  a  conputationally  inefficient  compound 
estimator,  we  consider  the  Davison  and  Hinkley  (1997)  recommendation  to  first  remove 
large  residual  observations  from  a  robust  fit.  Their  choice  of  estunators,  LTS,  is  high- 
breakdown  but  is  not  a  boimded-influence  estimator.  Therefore,  outliers  will  likely  be 
removed  from  the  sample  only  if  they  are  not  high-leverage  points. 

For  the  modified  Gxuist  and  Mason  data,  we  first  remove  the  observations  with 
standardized  residuals  exceeding  a  value  of  2.5  from  a  fit  with  the  high-breakdown,  high- 
efficiency,  and  bounded-influence  Simpson  and  Montgomery  confound  estimator. 
Subsequently,  resampling  methods  estimate  the  least  squares  prediction  error  for  variable 
selection.  The  Simpson  and  Montgomery  filter  removes  the  outlying  observations  at  the 
beginning  of  every  one  of  the  100  replications.  From  Table  5.1 1,  this  scheme 


successfully  identifies  the  correct  model  with  high  probability  for  the  proposed  change  in 
prediction  error  criterion.  The  minimum  prediction  error  criterion  has  a  high  probability 
of  correct  model  selection  only  for  the  bootstrap  half  sample. 


Table  5.1 1.  Results  for  100  simulations  for  model  selection  from  Shao  (1996) 
using  the  Simpson  and  Montgomery  compound  estimator  to  remove  outliers 
followed  by  least  squares  estimates.  Observations  8, 15, 28,  and  39  are  made 
residual  outliers  and  the  estimator  is  Simpson  and  Montgomery.  The  true  model  [2, 
0, 0, 4,  8]  is  shaded.  The  top  number  in  each  cell  is  the  proportion  of  1 00  replicates 
that  the  model  was  selected  using  the  minimization  of  prediction  error  criterion  and 
the  bottom  number  is  the  proportion  with  the  change  in  prediction  error  criterion  with 
constant  0.025.  The  values  in  brackets  are  the  proportion  of  10,000  bootstrap 


samples  that  the  model  is  selected. 


Model 

parameters 


Po,  P3»  P4 


Po,  Pi,  Ps,  P4 


Po,  Pi,  P2,  Ps,  P4 


Cross- 

Val 

Lv  1  out 


0.000 

0.000 


0.000 

0.000 


Cross- 
Val 
K  =  6 


0.000 

0.000 


0.000 

0.000 


0.580 

0.950 


0.170 

0.010 


0.050 

0.040 


Adj 

Cross - 
Val  K=6 


0.000 

0.000 


0.000 

0.000 


The  fit  for  the  clean  36  observations  has  an  of  approximately  0.99,  much  like 
the  scenarios  of  Section  5.4.  If  the  random  error  added  to  the  response  variable  is 
generated  from  variates  of  NID(0,  5^)  instead  of  NID(0, 1),  then  the  Simpson  and 
Montgomery  compound  estimator  (and  all  other  robust  regression  estimators)  feils  to 
provide  meaningful  parameter  estimates.  The  parameter  estimates  vary  widely  between 
the  simulation  replicates  and  significantly  within  the  bootstrap  samples.  None  of  the 
resan:q)ling  methods  or  criteria  reliably  identify  the  specified  model. 
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5.6.3  Compound  Estimator  Resampling  Methods  for  a  Noisy,  High-Dimension  Data 
Set  with  Multiple  Outliers 

To  investigate  the  performance  of  the  resampling  methods  in  the  presence  of 
outliers  and  noisy  data,  we  generate  an  artificial  data  set.  The  response  is  generated  from 
y  =  Xp  +e  where  X  is  a  40  jc  9  matrix  of  standard  normal  variates  augmented  with  a 
column  of  ones,  the  known  vector  of  parameters  P  'is  [2, 9,  6, 4,  8, 0, 0, 0, 0, 0]  and  e  is 
N[D(0,<t^I)  with  a  =  5.  The  last  five  observations  are  IOct  residual  outliers  and  the  last 

three  observations  are  also  lOcr  outliers  in  X-space  for  variables  X3  through  X6. 

In  contrast  to  all  previous  findings  for  the  minimum  prediction  error  criterion,  the 
bootstrap  half  sample  results  in  Table  5.12  do  not  improve  upon  those  from  the  full 
sample.  The  change  in  prediction  error  criterion  is  successful  for  all  methods  except  the 
bootstrap  half  sample.  The  minimiun  prediction  error  criterion  does  not  perform  well 


with  cross-validation. 
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Table  5.12.  Results  for  50  simulation  replicates  for  model  selection  with  the  Simpson 
and  Montgomery  estimator.  All  regressor  variable  values  are  generated  from  a  standard 
normal  distribution.  The  response  is  generated  from  the  vector  [2, 9,  6, 4,  8,  0,  0, 0,  0, 0] 
and  8  MID  (0,cr^I )  with  a  =  5.  The  last  five  observations  are  10a  residual  outliers  and 

the  last  three  observations  are  also  10a  outliers  in  X-space  for  variables  X3  through  xe. 
The  top  number  in  each  cell  is  the  proportion  of  50  replicates  that  the  model  was  selected 
using  the  minimization  of  prediction  error  criterion  and  the  bottom  number  is  the  change 
in  prediction  error  criterion  with  constant  0.01.  The  values  in  brackets  are  the  ratios  out 
of 5,000  bootstrap  samples  that  the  model  is  selected.  Results  are  accurate  to 
approximately  +  0.06. 


Model 

parameters 

Cross- 

Val 

Lv  1  out 

Cross- 

Val 

K  =  6 

Adj 

Cross - 
Val  K=6 

Bootstrap  Full 
Sample  (n=40) 

Bootstrap  Half 
Sample  («=20) 

Po 

0.000 

0.000 

0.000 

0.000 

0.000 

0.000 

0.000  [0.004] 
0.000  [0.001] 

0.000  [0.037] 
0.000  [0.000] 

Po»  Pa 

0.000 

0.000 

0.000 

0.000 

0.000 

0.000 

0.000  [0.016] 
0.000  [0.008] 

0.000  [0.000] 
0.000  [0.004] 

Po»  P35  P4 

0.000 

0.000 

0.000 

0.000 

0.000 

0.020 

0.000  [0.022] 
0.000  [0.009] 

0.000  [0.000] 
0.000  [0.003] 

Po»  Pl9  p39  p4 

0.000 

0.000 

0.000 

0.000 

0.000 

0.000 

0.000  [0.035] 
0.000  [0.031] 

0.060  [0.029] 
0.000  [0.016] 

Po-P4 

0.540 

0.820 

0.360 

0.820 

0.420 

0.840 

0.780  [0.340] 
0.900  [0.549] 

0.720  [0.418] 
0.560  [0.382] 

Po-Ps 

0.180 

0.060 

0.380 

0.020 

0.300 

0.020 

0.180  [0.215] 
0.040  [0.061] 

0.220  [0.219] 
0.100  [0.068] 

Po-Pe 

0.040 

0.020 

0.100 

0.060 

0.100 

0.040 

0.020  [0.138] 
0.000  [0.073] 

0.000  [0.111] 
0.060  [0.112] 

Po-p7 

0.100 

0.040 

0.060 

0.040 

0.040 

0.020 

0.020  [0.091] 
0.060  [0.101] 

0.000  [0.082] 
0.220  [0.187] 

Po  "Ps 

0.060 

0.000 

0.000 

0.020 

0.000 

0.020 

0.000  [0.078] 
0.000  [0.091] 

0.000  [0.056] 
0.040  [0.131] 

P0-P9 

0.080 

0.060 

0.100 

0.040 

0.140 

0.040 

0.000  [0.062] 
0.000  [0.076] 

0.000  [0.048] 
0.020  [0.098] 

5.6.4  A  Designed  Experiment  for  Resampling  Methods  with  Compound 
Estimators 

Based  on  the  modified  Gunst  and  Mason  data  and  the  artificial  data  set  in  Section 
5.6.3,  it  appears  as  if  resampling  methods  are  appropriate  for  variable  selection  when 
outliers  are  present.  To  gain  a  better  understanding  of  resampling  methods’  performance 
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for  the  variable  selection  problem  with  multiple  outliers,  we  run  a  designed  experiment 
using  Monte  Carlo  simulation.  The  ejq)eriment  varies  characteristics  of  not  only  the  data 
set,  but  also  of  the  resampling  method  to  quantify  the  expected  performance  of  the 
various  techniques.  We  use  the  Simpson  and  Montgomery  compound  estimator  for  all 
simulations.  Note  that  we  are  not  removing  the  outliers  from  analysis  first  as 
recommended  by  Davison  and  Hinkley  (1997)  and  explored  in  Table  5.1 1. 

5.6.4.1  Planning  the  Simulation  Experiment 

All  data  sets  consist  of «  =  40  observations  and  p  =  5  parameters.  The  response 
vector  is  generated  as  y  =  Xp  +e  where  X  is  the  design  matrbc  of  i.i.d.  random  variates 
from  the  standard  normal  distribution,  P  'is  the  vector  of  known  parameters  [2, 4,  8,  0, 0], 
and  8  is  the  vector  of  random  error  variates  from  a  N(0,  cr^I )  distribution.  For  the  last 

four  observations,  a  value  of  ^  is  added  to  each  regressor  variable  value  to  create  high- 
leverage  points.  Residual  outliers  are  created  for  the  last  four  or  eight  observations 
(depending  on  the  fector  setting)  by  adding  J  to  the  expected  response  value. 

Factors  for  the  Experiment.  From  the  previous  results,  pilot  studies,  and 
knowledge  of  compound  estimators,  the  fisllowing  factors  are  included: 

•  Percentage  of  outliers  contaminating  the  sample.  This  could  be  an  important 
factor  because  resampling  methods  could  form  samples  with  too  many  outliers 
that  cause  the  estimator  to  break  down.  Also,  prediction  error  is  higher  with 
more  outliers.  The  outlier  density  levels  are  10%  and  20%. 
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•  Outlying  distance,  S .  This  measures  how  many  standard  deviations  to  make 
the  outliers;  both  in  leverage  and  residual.  Larger  values  could  lead  to  greater 
prediction  error.  The  levels  are  5  standard  deviations  and  10  standard 
deviations. 

•  Signal-to-noise  ratio,  measured  through  <7^ .  Section  5.5  demonstrates  that 

this  is  a  critical  factor  in  determining  the  success  of  a  procedure.  The 
probability  of  correctly  selecting  the  model  is  directly  proportional  to  the 
signal-to-noise  ratio.  The  levels  for  are  1  and  5  which  corresponds  to 
approximate  values  of  0.98  and  0.80  respectively  for  an  OLS  fit  on  the 
imcontaminated  portion  of  the  data. 

•  Bootstrap  sample  size.  Shao  (1996)  demonstrates  that  this  is  the  single  most 
important  factor  to  correctly  identity  the  active  model  parameters.  Sections 
5.4  and  5.5  also  indicate  the  half  sample  size  is  preferred.  However,  there 
may  not  be  an  appreciable  difference  for  contaminated  data  sets  (see  Table 
5.12).  The  levels  again  are  the  full  sample  (n  =  40)  and  the  half  sample  (n  = 
20). 

•  Number  of  bootstraps  per  replication.  Up  to  this  point,  we  have  followed 
Davison  and  Hinkley’s  (1997)  recommendation  to  use  100  bootstrap  samples 
as  an  absolute  minimum.  Breiman  and  Spector  (1992)  and  Breiman  (1996)  do 
not  exceed  50  bootstrap  samples  and  conclude  for  some  applications  that  as 
few  as  5  may  suffice.  Clearly,  fewer  bootstraps  than  100  would  be  preferred 
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when  resampling  with  a  conqwund  estimator.  The  levels  in  the  experiment 
are  B  =  25  and  100  bootstrap  samples. 

•  Size  of  assessment  set  in  cross-validation.  This  factor  replaces  the  previous 
two  bootstrap-specific  factors  in  the  cross-validation  runs.  The  purpose  of  this 
factor  is  to  determine  if  there  is  a  significant  difference  between  K-Fold  and 
leave-one-out  cross-validation  procedures.  The  levels  are  6  (K-Fold)  and  1 
(leave-one-out). 

Experimental  Design  and  Response.  All  five  factors  for  the  bootstrap  have  only 
two  levels;  therefore,  an  attractive  screening  design  for  this  ejq)eriment  is  a  2v  ‘  design. 

This  design  can  estimate  the  main  effects  and  the  two-fector  interactions  fi^ee  from 
aliasing.  The  cross-validation  design  is  a  full  factorial  2^.  The  response  value  again  is 
the  proportion  of  replicates  in  which  the  various  parameter  models  are  selected.  We  also 
investigate  the  usefulness  of  weighting  the  squared  residuals  by  the  final  weights  from 
the  Simpson  and  Montgomery  estimator.  The  motivation  for  this  additional  response 
comes  from  Ronchetti  and  Staudte’s  (1994)  robust  Cp  criterion  and  from  pilot 
experiments  that  showed  a  significant  amount  of  the  prediction  error  could  be  attributed 
to  the  large  residuals  of  the  planted  outliers. 

Pilot  studies  are  necessary  to  select  an  appropriate  value  for  the  constant  for  the 
change  in  prediction  error  criterion.  The  three  factors  that  affect  the  choice  of  the 
constant  are  the  outlying  distance,  signal-to-noise  ratio  and  weighting  of  the  residuals. 
Table  5.13  gives  the  values  for  the  constants  used  in  the  simulations. 
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Table  5.13.  Values  of  constants  used  in  simulations 
for  the  change  in  prediction  error  criterion. 


5 

OTe 

Constant  for 
unwtd 

Constant  for 
weighted 

5 

1 

0.0250 

0.0250 

10 

1 

0.0250 

0.0100 

5 

5 

0.0050 

0.0010 

10 

5 

0.0025 

0.0005 

5.6.4.2  Simulation  Results 

Bootstrap  Methods.  The  most  striking  aspect  of  the  probabilities  in  Table  5.14 
is  the  contrast  between  the  left  half  and  the  right  half  of  the  table.  This  corresponds  to  the 
difference  between  prediction  errors  computed  with  the  unweighted  versus  weighted 
residuals.  Weighting  the  residuals  leads  to  nearly  perfect  selection  of  the  correct  model 
using  the  change  in  prediction  error  criterion  independent  of  any  other  factor  setting. 
Conversely,  this  weighting  scheme  almost  always  incorrectly  selects  the  largest  model  if 
the  minimum  prediction  error  criterion  is  used.  This  suggests  some  modifications  could 
be  made  to  the  prediction  error  calculation  imder  a  weighting  scheme  if  the  minimum 
prediction  error  criterion  is  used.  The  modified  calculation  could  account  for  the  number 
of  model  parameters;  similar  to  the  robust  Cp . 

If  the  residuals  are  not  weighted  (the  left  half  of  Table  5.14),  the  most  important 
factor  vuider  either  selection  criterion  is  the  amount  of  noise  used  to  generate  the  response 
values.  The  correct  model  is  selected  with  virtual  certainty  if  the  signal-to-noise  ratio  is 
high  (  cTj  =  1 )  under  the  change  in  prediction  error  criterion.  In  fact,  this  is  the  only 
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significant  factor  from  the  ANOVA  (R^  =  0.85)  for  unweighted  residuals  with  the  change 
in  prediction  error  criterion.  For  the  unweighted  residuals  with  minimum  prediction  error 
criterion,  the  significant  effects  from  ANOVA  (R^  =  0.87)  are  the  signal-to-noise,  the 
bootstrap  sample  size,  and  the  number  of  bootstrap  samples.  BettCT  model  selection 
occurs  with  smaller  bootstrap  sample  sizes,  larger  bootstrap  samples  and,  surprisingly, 
lower  signal-to-noise  ( cr^  =  5).  The  change  in  prediction  error  criterion  significantly 

outperforms  the  minimum  prediction  error  criterion  with  unweighted  residuals  in  high 
signal-to-noise  scenarios  and  is  moderately  outperformed  in  the  low  signal-to-noise 


scenarios. 
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Table  5.14.  2y' design  and  results  for  bootstrap  methods  using  compound  estimators. 

The  top  values  in  each  cell  are  the  proportion  of  times  that  the  model  is  selected  out  of  50 
replications  using  the  minimum  prediction  error  criterion.  The  bottom  values  are  the 


%  out 

S 

Boot 

Size 

n-m 

Num 

Boot 

Boot 

Po-Pi 

Boot 

Po'P2 

Boot 

wt 

Po-Pl 

Boot 

wt 

Po'Pa 

Boot 

wt 

Po-Pa 

Boot 

wt 

Po’P4 

10 

5 

100 

0.000 

0.000 

0.740 

0.980 

0.000 

0.000 

0.000 

0.000 

HillUl 

20 

H 

1 

20 

25 

0.000 

0.000 

0.680 

0.960 

0.260 

0.020 

0.060 

0.020 

0.000 

0.000 

111^^ 

IJiliU 

10 

m 

1 

20 

25 

0.000 

0.000 

0.540 

0.960 

0.000 

0.000 

0.000 

1.000 

mmjim 

QiQQI 

^^1 

20 

10 

1 

20 

100 

0.000 

0.000 

■iTZni 

lltiglll 

0.000 

0.000 

0.000 

0.000 

miMjI 

0.000 

0.000 

10 

5 

5 

20 

25 

0.000 

0.000 

0.760 

0.600 

0.180 

0.120 

0.060 

0.280 

0.000 

0.000 

0.000 

1.000 

■lllllltl 

1.000 

0.000 

20 

5 

5 

20 

100 

0.000 

0.000 

0.880 

0.780 

0.080 

0.100 

0.040 

0.120 

0.000 

0.000 

0.000 

0.940 

0.000 

0.040 

1.000 

0.020 

10 

10 

5 

20 

100 

0.000 

0.000 

0.820 

0.720 

0.140 

0.120 

0.040 

0.160 

0.000 

0.000 

0.000 

0.100 

0.000 

0.000 

1.000 

0.000 

20 

10 

5 

20 

25 

0.000 

0.000 

0.740 

0.620 

0.000 

0.000 

0.180 

0.960 

0.100 

0.000 

0.720 

0.040 

5 

H 

H 

25 

0.000 

0.000 

0.440 

1.000 

0.340 

0.000 

0.220 

0.000 

0.000 

0.000 

0.000 

1.000 

0.020 

0.000 

0.980 

0.000 

20 

5 

1 

m 

100 

0.000 

0.000 

0.620 

1.000 

0.320 

0.000 

0.060 

0.000 

0.000 

0.000 

0.000 

1.000 

1.000 

0.000 

10 

10 

■■ 

40 

mm 

0.000 

0.000 

0.460 

1.000 

1211^ 

0.000 

0.000 

0.000 

1.000 

1.000 

0.000 

10 

■■ 

40 

m 

0.000 

0.000 

0.320 

1.000 

0.000 

0.000 

0.800 

0.000 

10 

H 

5 

40 

100 

0.000 

0.000 

0.680 

0.720 

liQtU 

0.000 

0.000 

0.000 

1.000 

1.000 

0.000 

20 

H 

H 

40 

25 

0.000 

0.000 

0.620 

0.680 

0.300 

0.220 

0.000 

0.000 

0.040 

0.920 

0.100 

0.040 

10 

10 

H 

40 

25 

0.620 

0.440 

0.060 

0.260 

lilililiM 

0.000 

1.000 

0.020 

0.000 

0.980 

0.000 

20 

10 

5 

40 

100 

0.000 

0.000 

0.700 

0.660 

H 

■tltVJJ 

0.000 

0.000 

0.020 

1.000 

giigni 

186 


Cross-Validation  Methods.  The  difference  between  using  weighted  and  unweighted 
residuals  is  not  as  distinct  with  cross-validation  methods  as  it  is  for  the  bootstrap.  The 
minimum  prediction  error  criterion  using  weighted  residuals  no  longer  selects  exclusively 
the  largest  parameter  model.  It  selects  the  correct  model,  independent  of  factor  settings, 
between  40  and  50%  of  the  time.  The  change  in  prediction  error  criterion  with  weighted 
residuals  selects  the  correct  model  with  very  high  probability  if  the  signal-to-noise  ratio  is 
high;  otherwise,  it  has  about  a  70%  correct  selection  rate  in  lower  signal-to-noise 
scenarios.  Signal-to-noise  ratio  is  the  only  significant  variable  from  ANOVA  (R^  =  0.85) 
for  the  change  in  prediction  criterion  using  weighted  residuals. 

If  the  residuals  are  not  weighted,  then  the  minimum  prediction  error  criterion  is 
still  not  effective;  correct  model  selection  probabilities  are  between  0.2  and  0.5.  All  four 
factors  are  significant  for  this  criterion  from  ANOVA  (R^  =  0.90).  Signal-to-noise  ratio 
has  the  largest  effect.  The  outlier  magnitude  and  signal-to-noise  ratio  and  their  two- 
factor  interaction  are  the  significant  factors  (R^  =  0.95)  for  the  change  in  prediction  error 
criterion  with  unweighted  residuals.  Performance  of  this  criterion  is  similar  to  the 
weighted  residuals  case:  near  perfect  model  selection  if  the  signal-to-noise  ratio  is  high 
and  about  70%  otherwise  (although  considerably  more  variance  between  factor  settings). 

Clearly,  the  best  method  across  all  scenarios  is  the  change  in  prediction  error 
criterion  applied  to  weighted  prediction  error  from  the  bootstrap  procedure.  If  the  change 
in  prediction  error  criterion  is  used  with  unweighted  residuals,  then  cross-validation  gives 
slightly  better  results  than  bootstrap  procedures.  For  the  minimum  prediction  error 
criterion,  cross-validation  is  not  recommended.  The  best  results  are  from  the  bootstrap 
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half  sample.  There  does  not  seem  to  be  much  difference  between  K-Fold  and  leave-one- 
out  cross-validation  procedures  for  either  criterion. 


Table  5.15.  2“*  design  and  results  for  cross-validation  methods  using  compound 
estimators.  The  top  values  in  each  cell  are  the  proportion  of  times  that  the  model  is 
selected  out  of  50  replications  using  the  minimum  prediction  error  criterion.  The  bottom 
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5.7  Summary 

This  chapter  proposes  criterion  for  model  selection  as  an  alternative  to  the  strict 
minimization  of  prediction  error.  A  criterion  that  selects  the  model  that  has  the  fewest 
variables  and  low  prediction  error  is  often  a  better  choice.  To  implement  and  test  this 
procedure,  an  operational  version  is  introduced  that  increases  the  dimension  of  the  model 
until  the  change  in  prediction  error  is  less  than  a  specified  percentage  of  total  prediction 
error  in  the  intercept  only  model.  Extensive  Monte  Carlo  simulation  suggests  this 
criterion  often  outperforms  the  minimum  prediction  error  criterion  in  both  contaminated 
and  uncontaminated  samples.  The  criterion  is  tested  using  prediction  error  estimates 
fi-om  the  leave-one-out  cross-validation,  K-Fold  cross-validation,  adjusted  K-Fold  cross- 
validation,  the  bias  adjusted  bootstrap,  and  the  bootstrap  half  sample  procedures. 

5.7.1  Summary  of  Results  for  Least  Squares  Estimation 

•  For  the  Shao  (1996)  scenarios,  the  proposed  criterion  has  nearly  a  100% 
correct  model  selection  rate  for  all  resampling  procedures  because  the  signal- 
to-noise  is  very  high  (R^  =  0.99)  and  there  are  only  4  regressor  variables. 

Only  the  bootstrap  half  sample  method  is  consistently  above  80%  using  the 
minimum  prediction  error  criterion. 

•  If  the  Shao  ( 1 996)  data  set  is  extended  to  9  regressor  variables,  then  the 
proposed  criterion  exceeds  a  91%  correct  selection  rate  for  all  five  resampling 
methods.  The  minimum  change  in  prediction  error  selection  rate  is  below 
70%  for  all  procedures  except  the  bootstrap  half  sample  (89%). 
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•  If  the  signal-to-noise  ratio  is  decreased  (R^  =  0.70)  in  the  9  regressor  variable 
model,  then  the  proposed  criterion  correct  selection  rate  is  approximately  75% 
for  all  methods  except  the  bootstrap  half  sample  (84%).  The  minimum 
prediction  error  criterion  selection  rate  is  below  63%  for  all  methods  except 
the  bootstrap  half  sample  (87%).  This  shows  that  methods  other  than  the 
bootstrap  half  sample  are  competitive  when  the  proposed  criterion  is  used. 

5.7.2  Summary  ofResults  for  Compound  Estimation 

•  If  10%  residual  outliers  are  planted  in  the  Gunst  and  Mason  data  set  (already 
contaminated  with  high-leverage  values),  then  all  methods  and  criteria  fail  to 
identify  the  correct  model  with  least  squares.  If  the  least  squares  estimator  is 
replaced  with  the  Simpson  &  Montgomery  compoimd  estimator,  then  the 
proposed  criterion  selects  the  correct  model  over  93%  of  the  time  for  all 
resampling  methods.  The  minimum  prediction  error  criterion  has  below  a 
50%  correct  selection  rate  except  for  the  bootstrap  half  sample  procedure 
(93%). 

•  If  the  number  of  regressors  increases  to  9  in  the  modified  Gunst  and  Mason 
data  and  the  signal-to-noise  ratio  decreases  (R^  =  0.80),  then  the  proposed 
criterion  selects  the  correct  model  over  80%  of  the  time  for  cross-validation 
and  90%  for  the  bootstrap.  The  minimum  change  in  prediction  error  is  below 
80%  for  the  bootstrap  and  below  55%  for  cross-validation.  Most  importantly, 
the  bootstrap  half  sample  is  worse  than  the  full  sample. 


190 


•  A  designed  experiment  investigating  the  effect  of  outlier  density,  outlier 
magnitude,  signal-to-noise  ratio,  and  sample  sizes  for  the  resampling  methods 
demonstrates  that  the  proposed  criterion  is  preferable  or  comparable  to  the 
TTiinimiim  prediction  error  criterion.  If  the  residuals  are  weighted  by  the  final 
weights  fi*om  the  compound  estimator,  the  correct  model  is  almost  always 
selected  with  the  proposed  criterion  for  the  bootstrap  methods.  However,  in 
the  same  scenarios,  the  minimum  prediction  error  criterion  always  overfits 
using  weighted  residuals.  For  unweighted  residuals,  the  proposed  criterion  is 
often  preferred  and  always  competitive  for  all  scenarios. 


Chapter  6 

Summary,  Contributions,  and  Future  Research 


6.1  Introduction 

This  research  uses  extensive  Monte  Carlo  simulation  to  evaluate  several  aspects 
of  the  multiple  outlier  problem  in  regression.  Chapter  1  demonstrates  the  impact  that 
multiple  outliers  can  have  on  a  regression  model,  the  failure  of  standard  OLS  diagnostic 
measures  to  detect  the  outliers,  and  the  trouble  outliers  can  cause  to  the  variable  selection 
process.  The  stated  objectives  of  this  research  are  to  comprehensively  test  the  leading 
multiple  outlier  detection  procedures,  improve  existing  methods  that  identify  and 
accommodate  outliers  and  investigate  the  usefulness  of  resampling  methods  for  variable 
selection  in  regression  models  with  multiple  oirtliers.  These  three  objectives  are 
addressed  in  Chapters  3-5  respectively.  This  chapter  provides  a  summary  of  the  major 
findings  for  each  objective,  the  original  contribution,  and  recommendations  for  future 
research. 

6.2  Comparative  Anafysis  of  Multiple  Outlier  Detection  Procedures 

The  objective  is  to  conduct  a  comprehensive  performance  study  of  numerous 
multiple  outlier  detection  methods  proposed  in  the  literature.  The  methods  are  tested  in 
realistic  and  challenging  regression  scenarios  to  establish  the  candidates’  strengths  and 


weaknesses. 
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6.2.1  Summary  of  Significant  Findings 

The  single  most  important  factor  affecting  the  performance  of  all  methods  is  the 
leverage  of  the  outlying  observations.  The  significant  results  are  reported  for  high- 
leverage  (exterior  X-space)  and  low-leverage  (interior  X-space)  outliers.  Many 
procedures  have  not  previously  been  tested  with  high-leverage  outliers. 

Low-leverage  outliers.  All  of  the  selected  methods  (except  Pena  and  Yohai) 
perform  well  for  low-leverage  outliers  once  the  outlying  distance  exceeds  5a  of  the 
regression  surface.  OLS  generally  detects  the  outliers,  but  suffers  from  significant  false 
alarms  as  the  magnitude  of  the  outlying  distance  increases.  The  indirect  procedures 
dominate  the  direct  methods  with  one  notable  exception.  The  Sebert  et  al.  clustering 
methodology  is  in  many  cases  the  best  method;  however,  the  false  alarm  rate  can  be  high 
and  some  scenarios  defeat  the  method.  Overall,  the  high-breakdown  point  (HBP) 
estimators  are  recommended;  in  particular,  the  MM  estimator.  For  all  procedures,  the 
factor  with  the  greatest  impact,  apart  from  leverage,  is  outlying  distance  followed  by 
outlier  density  and  dimension,  respectively. 

High-leverage  outliers.  The  HBP  estimators  that  are  successful  in  the  low 
leverage  scenarios  perfiirm  poorly  if  the  outliers  are  also  remote  in  X-space.  Most  direct 
procedures  lose  a  significant  amount  of  detection  capability  with  the  high-leverage  points 
because  the  algorithms  rely  on  a  least  squares  residuals.  The  conpound  robust  regression 
estimators  are  generally  preferred  to  the  direct  algorithms.  The  Simpson  &  Montgon^ry 
compound  estimator  has  the  best  overall  performance.  Also,  the  Rousseeuw  and  van 
Zomeren  method  using  simulated  cutoff  values  is  powerful.  This  suggests  that  the  newer 
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MVE  and  LMS  algorithms  are  not  plagued  as  much  by  the  criticisms  of  the  random 
sampling  schemes.  For  all  methods,  the  most  significant  fectors  affecting  performance 
are  the  leverage  and  the  residual  magnitude  and  their  two-factor  interaction. 

6.2.2  Contributions 

Several  multiple  outlier  detection  procedures  have  been  proposed  in  recent  years. 
All  demonstrate  good  results  in  the  authors’  limited  studies  that  are  often  restricted  to 
“classic  data  sets”  or  low-dimension,  low-leverage  examples.  There  has  not  been  a 
comprehensive  evaluation  of  procedures  since  1990.  Every  method  tested  in  this 
research  has  been  proposed  since  1990.  The  contributions  are: 

•  A  direct  comparison  of  the  current  multiple  outlier  detection  methods. 

•  Sensitivity  analysis  of  all  procedures  to  outlier  magnitude,  density,  leverage, 
and  configuration  in  X-space. 

•  The  recommendation  that  robust  regression  estimators  are  in  most  cases 
superior  to  the  direct  methods.  It  may  be  of  little  use  to  integrate  one  of  the 
specialized  direct  methods  into  a  suite  of  regression  analysis  tools.  Robust 
regression  capability  is  all  that  is  required. 

6.2.3  Future  Research 

Monte  Carlo  simulation  is  the  method  used  to  evaluate  performance  in  the 
selected  outlier  scenarios.  These  scenarios  are  limited  to  mean  shift  outliers  and  typically 
multiple  point  clouds.  Performance  studies  with  other  approaches  to  data  generation 
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would  be  useful  to  further  test  the  procedures.  The  results  from  this  research  could  be 
used  to  screen  the  multiple  outlier  detection  methods  and  alternative  outlier  scenarios 
could  be  run  (e.g.  Breiman  and  Spector,  1992,  Rocke  and  Woodruff,  1996,  Wilcox, 
1996a). 

The  research  in  Chapter  3  shows  that  in  general  the  robust  regression  estimators 
outperform  the  direct  methods.  As  such.  Chapter  4  explored  ways  to  improve  the 
compoimd  estimators.  It  is  likely  that  some  of  the  direct  methods  could  be  improved  by 
integrating  a  robust  estimator  into  the  process.  For  example,  most  direct  methods  suffer 
significant  loss  in  power  for  the  high-leverage  scenarios.  This  often  can  be  traced  back  to 
the  method  depending  on  some  form  of  the  least  squares  residual  driving  the  algorithm. 

Two  recent  multiple  outlier  detection  methods  (Lee  and  Fung,  1997,  and  Luceno, 
1998)  address  the  generalized  linear  model  (GLIM).  There  are  no  results  in  the  literature 
that  compare  detection  methods  for  the  GLIM.  Fmthermore,  many  of  the  concepts  for 
the  direct  identification  methods  and  robust  estimators  from  this  research  could  be 
applied  to  the  GLIM.  Improved  methods  could  be  proposed  for  the  GLIM. 

6.3  An  Improved  Compound  Estimator 

The  second  research  objective  is  to  use  the  results  from  the  performance  study  in 
Chapter  3  and  improve  upon  an  existing  technique.  The  mechanics  of  compound 
estimators  are  evaluated  more  closely  because  of  their  favorable  performance  with  high- 
leverage  outliers  in  the  comparative  analysis.  The  leading  comp>ound  estimators  have 
vulnerability  in  high-dimension,  high-leverage  and  high-density  scenarios.  Two 
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characteristics  of  the  compound  estimators  are  noted  in  these  scenarios  that  require  closer 
scrutiny.  The  ii-weights  are  not  unusual  for  the  high-leverage  outliers  and  the  final 
parameter  estimates  do  not  differ  significantly  from  the  initial  estimates. 

6.3.1  Summary  of  Significant  Findings 

Performance  study  on  measures  of  leverage.  This  Monte  Carlo  simulation  study 
compares  the  Mahalanobis  distance  (hat  diagonal),  MVE,  MCD,  Hadi  sequential  point 
addition  algorithm,  Sebert  et  al.  clustering  methodology,  M-estimates  of  covariance,  and 
Rocke  and  Woodruff  hybrid  algorithm  to  identify  remote  observations  in  X-space. 
Mahalanobis  distance  breaks  down  in  nearly  all  tested  scenarios  and  the  A/-estimates  of 
Covariance  performs  only  slightly  better.  The  Hadi  algorithm  can  be  tuned  for  excellent 
performance  except  in  high-dimension.  The  MVE  and  MCD  have  comparable 
performance  to  one  another;  the  MCD  demonstrates  slightly  better  results  overall.  The 
Sebert  et  al.  method  performs  well,  but  can  be  vulnerable  when  the  predicted  response 
values  are  not  Y-space  outliers.  Overall,  the  Rocke  and  Woodruff  method  demonstrates 
the  best  results  for  detection  capability  and  resistance  to  folse  alarms. 

A  new  measure  of  leverage  in  published  compound  estimators.  Incorporation  of 
the  Rocke  and  Woodruff  robust  distances  in  the  Coakley  and  Hettmansperger  and 
Simpson  and  Montgomery  compoimd  estimators  does  not  improve  the  performance  in  the 
vulnerable  scenarios.  The  final  weights  are  slightly  unusual  for  the  outliers  if  the  new 
leverage  measme  is  used.  However,  if  the  number  of  iterations  of  IRLS  is  increased  to  3 


or  4,  then  the  outliers  are  properly  assigned  large  residual  values  and  the  regression 
sur&ce  is  not  pulled  toward  the  outliers. 
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Initial  estimator  study.  In  many  high-leverage  scenarios,  the  high-breakdown 
estimator  provides  poor  estimates  in  the  first  stage  of  a  compound  estimator.  High- 
breakdown  point  estimators  do  not  have  bounded-influence.  An  initial  estimator  is 
proposed  that  removes  only  the  high-leverage  and  the  high-residual  observations  from  the 
sample,  rather  than  50%  of  the  observations  as  the  common  high-breakdown  estimators 
often  do.  High-leverage  points  are  removed  from  the  sample  if  the  Rocke  and  Woodruff 
robust  distance  values  exceed  the  cutoff  value.  Next,  the  residual  outliers  are  removed  if 
the  standardized  residual  from  an  A/M  fit  exceeds  approximately  2.0.  Lastly,  an  OLS  fit 
on  the  remaining  observations  provides  the  parameter  estimates.  This  is  an  efficient, 
high-breakdown,  and  boimded-influence  initial  estimator.  Testing  indicates  that  this 
estimator  is  highly  successful  not  only  in  the  high-density,  high-dimension  and  high- 
leverage  scenarios,  but  also  all  other  outlier  configurations. 

Proposed  compound  estimator.  The  proposed  compoxmd  estimator  uses  the  new 
initial  estimator  and  also  the  improved  Rocke  and  Woodruff  robust  distances  for  the  7t- 
weight  component.  It  significantly  ejqiands  the  effective  region  of  operability  for 
compormd  estimation  with  respect  to  outlying  distance  in  both  leverage  and  residual. 
Also,  the  estimator  performs  well  in  a  published  comparative  analysis  of  robust 
regression  estimators  (Simpson  and  Montgomery,  1998b)  where  the  leverage  distances 
are  not  as  challenging. 
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6.3.2  Contributions 

•  A  comprehensive  performance  study  for  measures  of  leverage. 

Published  results  are  limited  to  certain  estimators  and  specific  scenarios. 
There  are  no  results  on  the  performance  of  the  MVE  and  MCD  with  the 
increased  efficiency  algorithms;  Simpson  and  Chang  (1997)  call  for  such 
a  study. 

•  An  efficient,  boimded-influence,  and  high-breakdown  initial  estimator. 
All  initial  estimators  are  high-breakdown  only  and  may  not  provide 
useful  parameter  estimates  in  high-leverage  scenarios.  A  good  initial 
estimate  is  essential  to  a  compoimd  estimator  because  the  final 
parameters  may  not  change  much  and  the  final  scale  estimate  is  often 
based  on  the  initial  estimator’s  residuals. 

•  An  improved  compound  estimator.  The  proposed  estimator  expands  the 
area  of  coverage  in  high-dimension.  Hampel  (1997)  states  that  a  major 
gap  in  robust  statistics  is  the  lack  of  results  and  available  tools  for  high- 
dimension. 

6.3.3  Future  Research 

The  proposed  initial  and  compoimd  estimator  could  be  used  as  indirect  methods 
for  multiple  outlier  detection.  Pilot  studies  show  that  these  methods  detect  the  planted 
outliers  in  the  scenarios  of  Chapter  3  where  all  other  methods  fail.  Additional  finite 
sample  performance  studies  are  needed;  especially  in  high-dimension.  An  improved  plot 
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to  the  Rousseeuw  and  van  Zomeren  robust  distances  from  the  MVE  and  standardized 
LMS  residuals  is  possible  by  replacing  these  measures  with  the  Rocke  and  Woodruff 
robust  distances  and  the  proposed  compound  estimator’s  standardized  residuals.  Possibly 
some  clustering  of  these  components  akin  to  Sebert  et  al.  could  be  useful  for  outlier 
identification. 

A  more  critical  evaluation  of  the  components  of  the  compound  estimators  could 
be  beneficial.  Specifically,  some  studies  can  be  done  on  the  best  way  to  form  the  n- 
weights  from  the  Rocke  and  Woodruff  robust  distances.  Also,  this  research  did  not 
consider  the  impact  of  changing  the  if/  function  and  estimates  of  scale.  Another 
opportunity  for  improvement  is  to  follow  the  Simpson  and  Chang  (1997) 
recommendation  to  use  a  Hill-Ryan  GM  objective  function  rather  than  Schweppe  or 
Mallows. 

6.4  Resampling  Methods  for  Variable  Selection 

The  last  research  objective  is  to  determine  the  appropriateness  of  resampling 
methods  for  variable  selection  in  the  presence  of  multiple  outliers.  Resampling  methods 
with  cross-validation  and  bootstrap  estimates  of  model  prediction  error  are  currently  the 
preferred  approach  to  variable  selection  in  OLS.  Their  major  drawback  is  that  they  are 
computationally  intense.  Robust  regression  estimators  are  also  computationally  intense. 
With  computational  power  increasing  at  dramatic  rates,  it  will  not  be  long  before  using 
resampling  methods  with  robust  regression  is  a  viable  approach  for  the  practitioner.  This 
research  explored  combining  these  two  classes  of  procedures. 
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6.4.1  Summary  of  Significant  Findings 

An  alternative  variable  selection  criterion  for  OLS,  All  of  the  proposed 
regression  variable  selection  procedures  with  resampling  methods  suggest  that  the  best 
model  is  the  one  with  the  minimum  prediction  error.  This  research  proposes  a  more 
realistic  criterion  that  selects  the  model  with  the  fewest  parameters  and  a  low  (not 
necessarily  minimum)  prediction  error.  The  scenarios  in  Shao  (1996)  are  rerun  using  the 
proposed  criterion.  The  results  indicate  that  the  proposed  criterion  is  superior  to  the 
minimum  prediction  error  criterion.  Therefore,  a  bootstrap  procedure  using  bootstrap 
sample  sizes  of  less  than  V2  the  original  sample  is  not  the  only  method  to  select  the 
appropriate  size  model.  The  proposed  procedure  also  works  well  for  cross-validation 
procedures  and  the  bootstrap  using  the  full  sample.  This  conclusion  is  still  valid  if  the 
dimension  of  the  problem  increases  or  if  the  value  is  lowered  from  0.995  (in  all  of 

Shao’s  scenarios)  to  a  more  realistic  value  of  0.70. 

Resampling  methods  with  compound  estimators.  Resampling  methods  are 
appropriate  for  compound  estimators.  The  compoimd  estimators  identify  the  correct 
model  most  of  the  time.  Results  are  better  with  the  proposed  selection  criterion  rather 
than  selection  by  minimum  prediction  error.  A  designed  experiment  tests  the  effects  of 
outlier  density,  outlier  magnitude,  signal-to-noise  ratio  and  resampling  method  sample 
sizes.  The  proposed  criterion  mostly  outperforms  or  is  competitive  with  the  minimum 
prediction  criterion.  The  signal-to-noise  ratio  is  the  most  important  factor  for  all 


methods. 
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A  large  portion  of  the  prediction  error  can  be  attributed  to  the  large  residual 
values  of  the  outliers.  An  estimate  of  prediction  error  is  proposed  that  weights  each 
observation’s  squared  prediction  error  by  the  final  weight  from  a  compound  estunator. 
The  resuhs  are  dramatically  different  from  the  unweighted  estimate  of  prediction  error. 
There  is  virtual  assurance  of  selecting  the  correct  model  with  the  proposed  criterion  and 
virtual  assurance  of  selecting  the  largest  parameter  model  with  the  minimum  prediction 
error  criterion.  These  conclusions  hold  independent  of  outlier  density,  outlier  magnitude, 
or  signal-to-noise  ratio. 

6.4.2  Contributions 

•  An  improved  variable  selection  criterion  for  bootstrap  and  cross-validation 
estimates  of  prediction  error  in  OLS  regression. 

•  Reliable  variable  selection  is  possible  in  OLS  with  cross-validation  and 
bootstrap  methods  if  the  proposed  criterion  is  used. 

•  The  proposed  criterion  and  resampling  methods  are  recommended  for  variable 
selection  with  compound  estimators. 

•  A  weighted  estimate  of  prediction  error  combined  with  the  proposed  criterion 
is  highly  effective  for  variable  selection.  This  method  is  also  robust  across  a 
variety  of  outlier  scenarios. 
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6.4.3  Future  Research 

This  research  has  demonstrated  that  resampling  methods  are  appropriate  for  the 
variable  selection  problem  in  robust  regression.  A  finite  sample  performance  study  that 
compares  analytical  variable  selection  procedures  from  the  as5miptotic  estimates  of  the 
covariance  matrix  to  the  resampling  methods  would  be  useful.  Additionally,  there  are 
other  bootstrap  methods  proposed  that  may  provide  better  results.  One  promising  method 
is  the  wild  bootstrap  (Mammen,  1992)  that  is  appropriate  for  regression  models  with 
heteroschedastic  errors. 

The  proposed  change  in  prediction  error  criterion  for  variable  selection  could  be 
improved.  This  criterion  only  captures  some  of  the  subjectivity  in  selecting  a  model  and 
is  overly  conservative.  Improvements  are  possible  by  using  some  measure  other  than 
percentage  of  null  model  prediction  error.  A  goal  programming  approach  is  possible.  An 
opportunity  exists  to  refine  the  weighted  estimate  of  prediction  error.  Enhancements  to 
the  Ronchetti  and  Staudte’s  (1994)  robust  Cp  for  resampling  could  also  be  considered. 
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MAKEDATA 

#  This  script  file  contains  the  data  generators  for  the  experiments. 

#  All  clean  observations  are  multivariate  normal  with  mean  7.5  and  a 

#  standard  deviation  of  4.0.  These  constants  are  not  important  to  the 

#  performance  of  the  procedures.  Some  procedures  require  a  column  of 

#  ones  for  the  constant  term  (Hadi  and  Simonoff ,  Swallow  and  Kianifard) , 

#  this  is  handled  internally  by  the  procedures.  t  ^  t 

#  For  multiple  point  clouds,  the  regressor  levels  are  perturbed  slightly 

#  by  uniform(0, 0.25)  to  keep  them  from  being  a  single  point  mass. 

#  The  responses  are  generated  by  multiplying  5.0  by  the  level  of  each 

#  predictor  value  and  adding  N(0,1)  noise  and  the  shift  if  the  cases 

#  are  regression  outliers. 

#  The  subroutine  gendata  generates  the  data  set  with  up  to  two  clouds . 

#  outl  is  the  number  of  outliers  in  the  first  cloud,  outshftl  is  the 

#  number  of  standard  deviations  to  shift  the  data  in  X-space,  yshiftl 

#  is  the  number  of  standard  deviations  to  shift  the  response,  n  is  the 

#  number  of  observations,  k  is  the  nui^er  of  regressors  and  x  is  the 

#  number  of  regressors  that  are  outlying  out  of  the  k. 

#  This  gendata  lets  clouds  be  outlying  in  fewer  than  p  variables .  The 

#  ones  not  outlying  are  random.  This  is  not  what  gendata2/6  do-  they 

#  put  the  other  variables  at  the  mean  of  7.5.  That  configuration  is 

#  significantly  more  difficult  to  detect. 

gendata<-function (outl, out2, outshftl, outshft2, yshiftl, yshift2,n,k,x) 

{ 

{ 

outs<-outl+out2  #  the  total  planted  outliers 

first<-n-outs+l  #  observation  number  of  first  planted  outlier 
last<-n-outs  #  observation  number  of  the  last  clean  case 

kmx<-k-x 

shiftl<-7.5  +  outshftl*4  #  place  cloud  1  at  this  location 
shift2<-7.5  +  outshft2*4  #  place  cloud  2  at  this  location 

#  one<-rep (1/n)  #  some  procedures  need  an  intercept 

j iri<-matrix (rnorm (last*k, 7 . 5, 4 . 0) , ncol=k)  #  predictors  for  clean  cases 
yin<-’apply  (5* j in,  1,  sum)  +  matrix  (rnoirmC last)  ,ncol=l) 
yin<-matrix (yin, ncol=l)  #  clean  response  values 

if  (k  ==  x)  #  if  outlying  in  all  variables 

j l<-matrix (shiftl+runif (outl*k, 0 . 0, 0 . 25) , ncol*k)  #  cloud  1  x  values 
j2<-matrix(shift2+runif (out2*k,0.0,0,25) ,ncol=k)  #  cloud  2  x  values 
} 

else 

j out K-matrix (shift 1+runif (outl*x, 0.0,0.25) ,ncol=x) 

#  outlying  subset  in  cloud 

jinl<-matrix(rnorm(outl*kmx,7.5,4.0) ,ncol-knix)  #  inlying  vars  in  cloud 
j  K-cbind  ( j  out  1 ,  j  ini ) 

jout2<“matrix ( shif t2+runif (out2*x, 0.0,0.25), ncol=x) 
jin2<-matrix (rnorm (out2*kmx, 7 . 5, 4 ) , ncol  =  kmx) 
j  2<-cbind ( j  out 2 , j in2 ) 

}  #endelse 

x<-rbind(jin, jl, j2)  #  the  x  values 

#  x<-cbind{one,x)  #  if  you  need  the  intercept 
x<-as .matrix (x) 

yl<-apply(5*jl,l,sum)+  yshiftl  #  responses  for  the  first  cloud 
yl<-matrix (yl, ncol=l) 

y2<-apply (5*j2, 1, sum) +yshift2  #  responses  for  the  second  cloud 
y2<-matrix (y2,ncol=l) 
y<-rbind(yin, yl,  y2) 
y<-mat rix ( y , ncol~l ) 

} 


znz  ^  ^  ^  ^  ^ 
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} 


return  (x,  y) 


This  gendata  is  for  n=40,  k=2  and  places  outliers  at  ® 
each  of  the  two  variables  e.g.  2  sigma  for  xl  and  -2  sigma  for  x2 
for  up  to  two  clouds.  This  is  used  to  keep  the  responses  for  the 
outliers  from  being  y  space  outliers. 


gendata2<- function (outl, out2, outshftll, outshftl2, outshft21, outshft22, yshiftl 

, yshift2, n, k, x) 


{ 

outs<'“Outl+out2  #  the  total  planted  outliers 
f irst<“n“OUts+l 


last<~n-outs 

shiftll<*“7.5  +  outshftll*4 
shiftl2<-7.5  +  outshftl2*4 
shift21<”7.5  +  outshft21*4 
shift22<-7.5  +  outshft22*4 

#  one<-rep(l,n) 

jin<-matrix(rnorm(last*2,7 .5,4.0) ,ncol-2) 
yin<-apply(5*jin, l,sum)  +  matrix (rnorm( last) ,ncol-l) 
yin<-matrix(yin,ncol=l)  ^  ^ 

joutll<-matrix(shiftll+runif (outl, 0 . 0, 0 . 25) , ncol-1 
j  outl2<-matrix (shiftl2+runif (outl, 0.0,0.25) , ncol-1) 
jl<-cbind{joutll,  joutl2)  „  „  „  >  i_i\ 

jout21<-matrix (shift21+runif (out2, 0.0,0.25)/  ncol  1) 
j  out22<-matrix (shift22+runif (out2, 0.0,0.25) , ncol— 1 ) 
j  2<-‘Cbind  ( j  out  2 1 ,  j  out22 ) 
x<-rbind(jin, jl,  j2)  #  the  x  values 

#  x<-cbind(one,x)  #  intercept 
x<-as .matrix (x) 

yl<-apply(5*jl,l,sum)+  yshiftl 
yl<-matrix (yl, ncol=l) 
y2<-apply (5* j2, 1, sum) +yshift2 
y2<-'mat rix  ( y2 ,  ncol=l ) 
y<-rbind (yin, yl/ y2) 
y< -matrix ( y, ncol=l ) 

} 

return (x, y) 


#  The  function  gendataO  does  the  same  as  gendata2  except  ^ 

#  The  specific  Lvel  for  each  of  the  6  variables  can  be  set  in  each  cloud. 

gendata6<-function (outl, out2, outshftll, outshftl2,outshftl3,outshftl4, 

^  outshftlS, outshftie, outshft21, outshft22, outshft23, outshft24, outshft25, 

outshft26, yshiftl, yshif t2 , n, k, x) 


outs<‘“Outl+out2  #  the  total  planted  outliers 

first<-n-outs+l 

last<’-n-'Outs 

shiftll<-7.5  +  outshftll*4 
shiftl2<~7.5  +  outshftl2*4 
shiftl3<-“7 .5  +  outshftl3*4 
shiftl4<-7.5  +  outshftl4*4 
shiftl5<-7.5  +  outshftl5^4 
shiftl6<-7.5  +  outshftl6*4 
shift21<-7.5  +  outshft21*4 
shift22<-7.5  +  outshft22*4 
shift23<-7,5  +  outshft23*4 
shift24<“7.5  +  outshft24*4 
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shift25<-7.5  +  outshft25*4 
shift26<“7.5  +  outshft26*4 

#  one<-rep(l,n) 

jin<-matrix (rnonn(last*6, 7 . 5, 4 . 0) ,ncol-6) 

yin<— apply ( 5* j in, 1 f sum)  +  matrix {rnorm( last) ^ ncol—1) 

yin<-matrix (yin, ncol=l) 

jout ll<-‘matrix  (shiftll+runif  (outl/  0.0,0. 25) , ncol— 1) 
joutl2<-matrix(shiftl2+runif (outl, 0.0, 0.25) ,ncol^l) 
joutl3<‘-matrix  (shift  13+runif  (outl,  0.0,0. 25) ,  ncol=l) 
joutl4<~matrix ( shift 14 +runif (outl, 0.0,0.25), ncol^l ) 
joutl5<“matrix(shiftl5+runif (outl, 0. 0, 0.25), ncol=l) 
joutl6<-matrix  (shiftie+runif  (outl,  0 . 0,  () .  25) ,  ncol-1) 
j  K-cbind ( j  outll, j  out 12, j out 13, j out 14 , joutl5, j  outl 6) 
jout21<-matrix(shift21+runif  (out2,  0 . 0,  0 . 25)  ,ncol==l) 
jout22<‘“matrix  (shift22+runif  (out2,  0.0,0.25)  ,  ncol=l ) 
jout23<”matrix (shift23+runif (out2, 0 . 0, 0.25) ,ncol=l) 
jout24<-matrix(shift24+runif (out2, 0 . 0, 0 . 25) ,ncol=l) 
jout25<-matrix(shift25+runif (out2, 0 . 0, 0 . 25) ,ncol=l) 
j out26<“*matrix ( shif t2 6+runif  (out 2 ,0.0,0.25), ncol— 1 ) 
j  2<-cbind ( j  out 2 1 , j  out22 , j  out 2  3 , j  out 2  4 , j  out 2  5 , j  out2  6 ) 
x<-rbind(jin, jl, j2)  #  the  x  values 

#  x<-cbind(one,x) 
x<-as .matrix (x) 

yl<-apply(5*jl,l,sum)+  yshiftl 
yl<-mat rix ( yl , ncol=l ) 
y2<-apply(5*j2, 1,  Siam) +yshift2 
y2<“mat rix ( y2 , ncol=l ) 
y<“rbind (yin, yl, y2) 
y< -matrix ( y, ncol=l ) 

} 

return (x,y) 


#  This  data  generation  function  generates  a  single  outlying  cloud  at 

#  a  random  location  in  the  interior  of  X  space  found  by  using  the  median 

#  of  the  last  three  clean  observations  in  each  variable.  The  parameters 

#  outshftl,outshft2,  yshift2  and  x  are  not  used. 

# 

gendatamedl<-function (outl, out2, outshftl, outshft2, yshiftl, yshift2,n, k, x) 

{ 

{ 

outs<-outl+out2 
last<“n-outs 
f irst<-n-outs+l 
lastm2<-last-2 

xin<-matrix (rnoimi(last*k, 7 .5,4.0) ,ncol=k) 
xmed<-apply (xin [lastm2 : last, ] , 2, median) 
temp<-matrix (rnorm (outs*k, 0, .05) , ncol=k) 
xmedm<  -  xme  d+ 1  emp 

xmedm<-matrix  (xmedm,  ncol=k,  byrow=T) 
x<“rbind (xin, xmedm) 

yin<-apply(5*xin,l,sum)  +  matrix (rnorm (last ) ,ncol=l) 
yout<“apply ( 5*xmedm, 1 , sum) +  matrix ( rnorm  (outs ) +yshif tl , ncol=l ) 
y<-rbind ( yin, yout ) 

} 

return (x, y, xmed) 

} 

§ 

#  The  function  gendatamed2  does  the  same  thing  as  gendatamedl  except  for  2 

#  clouds.  The  second  cloud  is  located  at  the  median  of  the  first  three 

#  clean  observations 

# 


^  ^ 
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gendatamed2<-function (outl, out2,outshftl, outshft2, yshiftl, yshift2,n, k,x) 

{ 

{ 

outs<~outl+out2 
last<-n“Outs 
f irst<-n-outs+l 
lastin2<-last-2 

#  one<-rep{l,n) 

xin< -matrix  (rnorm (last*k:,  7 . 5,  4 . 0 )  f  ncol— k) 
xmedl<— apply  ( xin  [  lastm2 :  las t  r  ]  f  2  ^  m©d.ian ) 
temp<-matrix (rnonn{outl*k, 0, .05) ,ncol=k) 
xmedml<-xmedl+temp 

xmedml<-matrix  (xmedml ,  ncol=kir  byrow— T ) 
xmed2<-apply  (xin[l:  3,  ] ,  2, median) 
temp< -matrix (rnorm(out2*k^  0, . 05) , ncol=k) 
xmedm2  < -xmed2 + 1  emp 

xmedm2<-matrix  {xmedm2,  ncol=k,  byrow=T) 
x<-rbind  (xin,  xmedml ,  xmedm2 ) 

#  x<-cbind(one,x) 

yin<-apply (5*xin, 1, sum)  +  matrix (rnorm( last) , ncol 
yout  K-apply  ( 5*xmedml ,  1 ,  sum)  +mat  rix  ( rnorm  ( out  1 )  +yshi  f  1 1 ,  ^ctol-l 
yout2<-apply  (5*xmedm2, 1,  sum)  +matrix  (rnorm(out2)  +yshift2,  ncol-1) 
y<-rbind (yin, youtl, yout2) 

} 

return (x, y, xmedl , xmed2 ) 

This  function  genrand  generates  regression  outliers  randomly 
in  x-space.  The  parameters  outshftl,  outshft2,  and  yshft2  are 
not  used.  These  outliers  are  not  in  multiple  point  clouds. 


genrand<- function (outl, 

{ 


out2, outshftl 


outshft2 , yshiftl, yshift2, n. 


{ 

outs<-outl+out2 
first<-n-outs+l 
last<-n-outs 
one<-rep (l,n) 

x<-matrix(rnorm(k*n, 7.5,4.0) ,ncol=k) 

yin<-apply(5*x[l:last, ] ,l,s\am)  +  matrix (rnorm (last) , ncol=l) 
yin<-mat rix ( yin, ncol=l ) 

#  x<-cbind(one,x) 
x<-as .matrix (x) 

yout<— apply ( 5*x [first : n, ] , 1 , sum) +  yshiftl 
yout<-as .matrix (yout ,ncol=l) 
y<-rbind (yin, yout) 
y< -mat rix ( y , ncol-1 ) 


} 


return (x, y) 


k,x) 


} 

#  The  function  genmix  generates  low  leverage  regression  outliers  (first 

#  outliers  specified)  at  random  locations  in  X-space  and  a  cloud 

#  of  high  leverage  regression  outliers  (second  outliers  specified) . 

#  The  outliers  may  be  unusual  in  any  number  of  variables  in  the  cloud. 

#  The  parameter  outshftl  is  not  used  since  the  first  set  of  outliers  are 

#  random . 

# 

genmix<- function (outl, out2, outshftl, outshft2, yshiftl, yshift2, n, k, x) 

{ 

{ 

outs<-outl+out2 

kmx<-k-x 


215 


first<-n-outs+l 
last<-n-outs 
firststop<*“n-out2 
second<-f irststop+1 

shiftl<-7.5  +  outshftl*4 
shift2<-7.5  +  outshft2*4 

#  one<^rep (1, n) 

xin<-inatrix(rnorm(k*  firsts  top,  7 .5,  4 . 0)  ,ncol=k) 
if  (k!-x){ 

xcloudout<-’matrix  (shift2+runif  (out2*x,  0.0,0.25)  ,ncol==x) 
xcloudirK-matrix  (rnorm(kinx*out2, 7 .5,4.0),  ncol=]atix) 
xcloud<-as .matrix (cbind (xcloudout , xcloudin) ) 

} 

else  { 

xcloud<-matrix (shift2+runif {out2*k, 0.0,0.25), ncol=k) 

} 

x<-rbind (xin, xcloud) 
x<-as .matrix (x) 

#  x<-'as  .matrix  (cbind  (one,  x)  ) 

yin<-apply (5*x[l:last, ] , l,sum)  +  matrix (rnorm (last ) ,ncol=l) 
yin< -matrix (yin, ncol=l ) 

youtl<-as.matrix(apply(5*x [first : firststop, ] , 1, sum) +  yshiftl,ncol=l) 
yout2<-as  .matrix  (apply  (5*x  [second :n,  ] ,  1,  s\jm)  +  yshift2,ncol~l) 
y<-as .matrix (rbind (yin, youtl, yout2 ) , ncol^l ) 

} 

return (x, y) 

} 

# 

#  The  function  gendata4  generates  4  multiple  point  clouds  that  may 

#  be  outlying  in  a  subset  of  the  k  variables.  All  outlying  variables 

#  must  be  at  the  same  level  like  in  "gendata". 

gendata4  <-f unction ( out 1 , out2 , out 3 , out  4 , outshf 1 1 , outshf t2 , outshf t3 , 
outshft4 , yshiftl, yshif t2, yshift3 , yshif t4 , n, k, x) 

{ 

{ 

outs<-outl+out2+out3+out4  #  the  total  planted  outliers 

first<-n-outs+l 

last<-n“Outs 

kmx<-k-x 

shiftl<-7.5  +  outshftl*4 
shift2<-7.5  +  outshf t2*4 
shift3<-7.5  +  outshf t3*4 
shift4<-7.5  +  outshft4*4 

#  one<-rep(l,n) 

jin< -matrix ( rnorm ( last *k, 7 . 5, 4 . 0) ,ncol=k) 
yin<-apply (5*jin, l,svm)  +  matrix (rnorm (last) ,ncol=l) 
yin<-matrix (yin, ncol=l ) 

if  (k  -=  X) 

{ 

jl<-matrix(shiftl+runif (outl*k, 0.0, 0.25) ,ncol=k) 
j2<-matrix(shift2+runif (out2*k, 0.0, 0.25) ,ncol=k) 
j3<-matrix (shift3+runif (out3*k, 0.0,0.25), ncol=k) 
j  4<-matrix (shift4+runif (out4*k, 0 . 0, 0 . 25) , ncol=k) 

}  #  end  if 

else 

{ 

joutK-matrix  (shiftl+runif  (outl*x,  0.0,  0.25)  ,ncol=x) 
jinK-matrix  (rnorm(outl*3anx,  7 . 5, 4 . 0)  ,ncol=kmx) 
j  K-cbind  ( joutl,  jinl ) 

jout2<-matrix (shift2+runif (out2*x, 0.0, 0.25) ,ncol=x) 
jin2<-matrix(rnorm(out2*kmx,7 .5, 4) ,ncol  =  kmx) 
j  2<-cbind ( j  out 2 , j in2 ) 
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jout3<-matrix  (shift3+runif  (out3*x,  0.0,  0.25)  ,ncol=^x) 
jin3<~matrix  (rnorm(out3*kiax,  7 . 5,  4 . 0) ,  ncol=kmx) 
j  3<-cbind ( j  out 3 , j in3 ) 

j out4<-inatrix (shift4+runif  (out4*x,  0.0,0 .25)  ,ncol=x) 

jin4<“inatrix  {rnorm(out4*)anx,  7 . 5,  4  )  ,ncol  =  kmx) 

j4<-cbind ( jout4, jin4 ) 

}  ^ondoXso 

x<-rbind(jin, jl, j2, j3, j4)  #  the  x  values 

#  x<-cbind(one,x) 
x<-as .matrix (x) 

yl<“apply(5*jl,l,sum)+  yshiftl 
yl<-matrix ( yl , ncol=l ) 
y2<-apply (5*j2, 1, sum)+yshift2 
y2<-“matrix  (y2,ncol=l) 
y3<-apply(5*j3,l,sum)+  yshiftS 
y3<-matrix ( y3 , ncol=l ) 
y4<-apply (5*j4, 1, sum)+yshift4 
y4<-matrix (y4,ncol=l) 
y<-'rbind  (yin,  yl  /  y2,  y3,  y4 ) 
y< -matrix (y, ncol=l) 

} 

return (x, y) 

} 

# 

j<-gendata (3, 3, 4, 5, 5, 5,  60,  6,  3) 

j 

SEBERT 

#  This  S-Plus  code  implements  the  Sebert  et  al.  (1998)  procedure 

#  to  identify  multiple  outliers  in  datasets.  This  code  is  significantly 

#  different  from  that  in  Sebert  (1996)  in  order  to  take  advantage  of 

#  some  recent  developments  in  the  language  and  use  standard  structures 

#  across  the  outlier  detection  procedures. 

# 

#  The  subroutine  resids.func  returns  the  scaled  predicted  and  residual  values 

#  from  OLS  regression. 

# 

resids . func<- function (x, y) 

{ 

{ 

e<-’lsf  it  (X,  y)  $residuals 
yhat<-y-e 

dat a<“Cbind ( yhat , e ) 
scaledata<-scale (data) 

} 

return (scaledata,  e) 

} 

# 

#  The  subroutine  claster  does  a  single  linkage  cluster  analysis  on  the  scaled 

#  predicted  and  residual  values.  The  purpose  is  to  identify  the  clean  group 

#  of  observations  and  declare  all  others  as  candidate  outliers.  The  clusters 

#  are  separated  by  cutting  the  tree  on  Mojenas’  distance. 

# 

claster. func<-f unction (data) 

{ 

h2<-hclust (dist (data,metric="euclidean” ) , method=” connected” ) 

maxheight<-h2$height [length (h2$height) ] 

meanheight<-mean (h2$ height ) 

stdheights<-sqrt (var (h2$height) ) 

mo jenas<-meanheight+l . 25*stdheights 

#  In  practice  this  never  occurs,  but  just  in  case  Mojenas  height  is 

#  greater  than  the  maxheight,  cut  the  tree  at  maxheight. 
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if  (inaxheight<=^o  j  enas ) 

clustergroups<-cutree  (h2^  h=inaxheight“ .  01) 
else  if (maxheight>mojenas) 

{ clustergroups<-cutree (h2 , h=mo j  enas ) } 

#  Of  all  the  groups  formed,  the  group  number  of  the  median  observation  should 

#  be  that  of  the  clean  subset. 

cleanid< -median (clustergroups) 
outlier<-ifelse (clustergroups==cleanid, 0, 1) 

} 

return (clustergroups,  outlier) 


#  The  prog.sim  sxibroutine  simulates  the  procedure  for  N  replications. 

#  This  determines  the  percent  of  outliers  detected  and  average  false 

#  alarm  rate.  The  set- seed (i)  is  required  to  have  common  random 

#  numbers  between  the  different  procedures  so  the  exact  same  data  sets 

#  are  used  to  compare  the  methods . 

§ 

prog.  sim<*“f unction  (N,  out  1,  out2,  shiftxl,  shiftx2,  shiftyl,  shifty2,n  ,k,x) 

{ 

{ 

outs<-outl+out2  #  total  outliers 

first<-n-outs+l  #  first  planted  outlying  obs  # 

last<-n“OUts  #  last  clean  obs  # 

plant<-0 

false<-0 

i<-l 

while (i<=N) { 
set . seed (i) 

cat ("iteration  ",i,”  ") 

#  Choose  any  data  generating  function  from  makedata. SSC.  Note  changes  may 

#  be  required  in  the  prog.sim  arguments  depending  on  the  selected  data  set. 

data<-gendata ( out 1, out 2, shiftxl , shiftx2, shiftyl , shifty2,  n,  k,  x) 

#  generate  predicted  and  residual  values. 

predres<-resids . func (data$x, data$y ) 

detect . out s<“Claster . func (predres$scaledata ) 

#  determine  number  of  planted  outliers  detected  in  this  run  and  add  to 

#  sum  from  all  previous  runs. 

plant<-plant  +  sum (detect . outs$outlier [first :n] ) 

#  determine  false  alarms  for  this  run  and  add  to  sum  of  previous  runs 

false<-false  +  sum(detect.outs$outlier [l:last] ) 
i<-i+l 

#  from  the  experiment,  the  total  probability  a  planted  outlier  is  detected 

#  (pp)  and  the  probability  a  clean  observation  is  classified  an  outlier. 
pp<-plant/ (N*outs) 

po<-false/(N*last) 

} 

return ( dat a , pp , po ) 

} 

Aj<-prog . sim(5, 6, 6, 2, 2, 5, 5,  60,  6,  3) 

Aj 

SWALLOW  and  KIANIFARD 

#  This  S-PLUS  program  implements  the  Swallow  and  Kianifard  multiple  outlier 

#  detection  procedure  in  Biometrics,  52,  pp.  545-556.  It  uses  MAD  and 

#  interquartile  range  as  robust  estimates  of  scale.  Outward  stepping 

#  recursive  residuals  used  to  determine  outlier  status. 

#  .  .  ^ 

#  The  function  sk.madir  computes  the  mean  absolute  deviation  (MAD)  and 

#  interquartile  range  (IR)  for  a  clean  simulated  set  of  data. 

#  It  is  called  in  sk. corf act  to  find  correction  factors. 

#  S-PLUS  MAD  uses  the  constant  1.4  consistency  in  Normal  Distribution 
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# 

sk.madir<- function (n, k) 

{ 

{ 

obs<“n*k 

x<-inatrix  (rnorm  (obs,  7.5,4),  nrow=n,  ncol=k) 

yhat<“NULL 

res<-NULL 

temp<-NULL 

y<~apply  (5*x,  l,suiti)  +  matrix  (rnorm (n)  ,ncol=l) 

olsf it<-lsf it (x, y) 

res<-olsfit$resid 

medresq<“quantile (res,  0.50) 

temp<-abs (res-medresq) 

madev<“quantile (temp, 0.50) 

ir<“quantile (res, , 75) -quantile (res, .25) 

} 

return (x, y, ir , madev) 

} 

#  Function  sk.corfact  determines  the  correction  factor  for  the  MAD  and  IR 

#  estimates  of  scale.  This  generates  Table  1  on  page  548.  The  MAD 

#  correction  factors  are  very  close  to  published  values,  the  IR  factors 

#  differ  (e.g.  for  n=25,  1.2541  vs  published  1.369  and  for 

#  n  =  50,  1.2899  vs  1.363).  For  n=60,  k=6  use  1.2436  for  IR  and  0.629 

#  for  MAD.  For  n  =  40  and  k  =  2  use  1.2711  for  IR  and  0.6452  for  MAD 

# 

sk. corf act <-f unction (N, n, k) 

{ 

{ 

iqv<-NULL 

madv<-NULL 

#  N  is  the  number  of  simulations  (5000)  and  we  create  a  vector  of  IR  and 

#  MAD  scale  estimates.  The  mean  of  these  vectors  is  the  correction  factor, 
for  (i  in  1:N) 

{ 

dat<-sk.madir (n, k) 
iqv[i3 <-dat$ir 
madv [ i ] <-dat $madev 
} 

corfir<-mean (iqv) 
corfmad<-mean  (madv) 

} 

return (cor fir, corfmad) 

} 

#  This  function  sk. initial  returns  the  initial  clean  set  of  ordered 

#  observations  by  externally  studentized  residual. 

# 

sk.  initiaK-function  (x,  y) 

{ 

{ 

x<-as .matrix (x) 
y<-as .matrix (y) 
id<-NULL 
vecone<-NULL 
n<-nrow(x) 
k<-ncol (x) 

2<-matrix ( 0 , n, k+3 ) 

vecone<-rep(l,n)  #  vector  of  Is 

id<-l:n  #  vector  identifying  observation  num 

#  We  do  not  have  first  column  of  Is  in  X 
olsf it<-lsf it (x, y, intercept=TRUE) 
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infl<-ls.diag{olsfit)  #  gives  access  to  hat  diagonals 
studres<-abs{infl$stud.res)  #note  this  is  internally  student ized 
temp<“Cbind (x, y, studres, id) 
z<“ temp [order (temp [ , k+2] ) , ] 

z<-as .matrix (cbind (1, z) ) 

#  this  is  the  matrix  of  x,y  sorted  by  (studentized  residual  I .  Note  Z  has 

#  an  initial  coliomn  of  l*s. 

return (z, temp)  #  will  not  run  if  only  return  one  value,  temp  not  used 

#  Function  sk. recursive  returns  the  recursive  residuals.  This  is  not 

#  the  most  efficient  code  since  an  updating  formula  as  in  Kianifard 

#  and  Swallow,  1990  could  be  used. 

# 

sk . recursive<-f unction ( z ) 

{ 

{ 

z<-as. matrix (z) 

k<-ncol(z)-3  #  here  k  =  p 

n<“nrow ( z ) 

nmK-n-l 

kpl<“k+l 

kp2<-k+2 

corf act <”if ( k==7 )  0.629  else  0.645 

w<-NULL 

temp<“NULL 

recurres<“NULL 

tswir<-NULL 

tswmad<-NULL 

scaledres<“NULL  _ 

#  The  i  loop  goes  over  the  i  observations  and  sequentially  adds  a  clean  obs 

for  (i  in  kpl:nml){ 
cleanrows<-i 

#  partition  our  ordered  z  matrix  into  clean  subset 

cleanx<-z [ 1 : cleanrows , 1 : k] 
cleany<-‘Z  [  1 :  cleanrows ,  k+1  ] 
cleanx<-as .matrix (cleanx) 
cleany<”as .matrix (cleany ) 
cpK-cleanrows+l 

#  do  least  squares  fit  on  the  clean  subset 

olsfit<-lsf it (cleanx, cleany, intercept=FALSE) 
inf  K-ls .  diag  (olsf  it ) 

#  we  will  use  the  clean  covariance  matrix  (unsealed  by  sigma)  to  determine 

#  unsealed  prediction  error  for  the  potential  outliers 

varcov<-infl$cov. unsealed 
varcov<~as .matrix (varcov) 

#  This  computes  equation  3.2  for  the  recursive  residual 

fitted<-sum(olsfit$coef [l:k] *z [cpl, l:k] ) 
num<- (z [cpl, k+1 } “fitted) 

denom<“Sqrt (1+ (z [cpl, 1 : k] %*%varcov%*%z [cpl, 1 :k] ) ) 
w  [ cpl  ]  <“n\am/denom 

}#end  i 

recurres<“W [ kp2 : n] 

#  The  following  computes  the  test  statistics  for  each  observation  by  using 

#  the  absolute  value  of  the  recursive  residual  divided  by  the  robust  estimate 

#  of  scale  (IR“Sigmair  or  MAD-sigmamad) 

#  sigmair<- (quantile (recurres, . 75) -quantile (recurres, .25) )/1.369 
medresq<“quantile (recurres, 0.50) 

temp<“abs (recurres-medresq) 
madev<-quantile (temp, 0.50) 
s i gmamad< -made v/ cor fact 

#  tswir<-abs (recurres/sigmair) 
tswmad<-abs ( recurres /sigmamad) 
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#  This  set  of  code  tests  the  distribution  of  the  OLS  residuals  from  the 

#  same  set  of  generated  data. 

usualols<-lsf it ( z [ , 1 : k] , z [ , k+1] , intercept=FALSE ) 
usualinf l<”ls . diag (usualols ) 

scaledres<-'abs  (usualols$resid/usualinfl$std.dev) 

return ( tswir, tswmad, scaledres, corfact ) 

} 

# 

#  The  function  sk.studtized  returns  the  test  statistics  for  the  studentized 

#  residuals  rather  than  the  recursive  residuals. 

# 

s  k . s t udt i z  ed<“ f unc t i on ( z ) 

{ 

{ 

z<-as .matrix ( z ) 
n<“-nrow  ( z ) 

k<-ncol  (z)  *“3  #  here,  k  =  p 

vecone<“NULL 

temp<-NULL 

tsstudir<“NULL 

tsstudmad<-NULL 

vecone<“rep ( 1 , n ) 

usualols<-lsf it (z [ , 1 : k] , z [ , k+1] , intercept=FALSE) 
usualinf K-ls . diag (usualols ) 
res<--usualols$resid 

sigmair<- (quantile (res, .75) -quantile (res,  .25) ) /1. 369 

medresq<-quantile (res,  0.50) 

temp<“abs  (res-medresq) 

madev<-quantile (temp, 0.50) 

sigmamad<-madev/0 . 639 

#  studentized  resid  =  ei/sigmahat (l-hii) ^ . 5 
tsstudir<-res/ (sigmair* (sqrt (vecone-usualinf l$hat ) ) ) 
tsstudmad<-res/ (sigmamad* (sqrt ( vecone-usualinf l$hat) ) ) 

} 

return ( t ss tudmad, tsstudir ) 

} 

# 

#  The  function  sk.critval  finds  the  critical  values  for  the  test  statistics 

#  from  simulation.  The  procedure  is  1.  generate  clean  data  (e.g.  mv  normal) 

#  for  n=40,  60  etc  observations.  2.  find  the  recursive  residuals  and  the 

#  studentized  residuals.  3.  find  estimates  of  sigma  from  IR  and  MAD  if 

#  using  recursive  residuals  then  IR  and  MAD  are  on  recursive  residuals  but 

#  for  studentized  residuals,  then  use  IR  and  MAD  on  OLS  residuals.  4. 

#  Do  this  5000  times  so  have  5000  x  25  matrix  of  test  statistics. 

#  5.  Find  quantiles-note  for  recursive  residual  quantiles  use  1  sided 

#  but  use  2  sided  for  studentized  (e.g.  for  alpha=.05,  use  97.5  quantile 

#  for  Ri  and  95  quantile  for  wi) .  This  generates  table  2  page  550.  MAD 

#  has  consistent  results  with  table  2,  IR  with  Studentized  residuals 

#  deviates  most  from  table  2. 

#  For  n  =  40,  k  =  2,  95%  is  2.0835,  97.5%  is  2.4597  and  99%  is  2.8688 

#  For  n  =  60,  k  =  6,  95%  is  2.053,  97.5%  is  2.380  and  99%  is  2.797 

# 

sk. critval<-function (N, n, k) 

{ 

{ 

obs<~n*k 

n\amrr<-n-k-l  #number  of  recursive  residuals  =n-p-l 

tswir<-matrix  ( 0,  nrow==N,  ncol=numrr ) 

tswmad<-matrix (0, nrow=N, ncol=numrr) 

tssir<-matrix (0,nrow=N,ncol=n) 

tssmad< -matrix  (0,  nrow=N,  ncol==n) 

usuaK-matrix  (0,nrow-N,  ncol=n) 
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for  (i  in  1:N) 

{ 

cat ("iteration  ",if"  ”) 
x<”niatrix  (rnorm(obSr  7.5,4)  ,nrow=n,  ncol=k) 
y<-5*apply  (X,  1,  Siam)  +rnorm  (nrow  (x)  ,0,1) 
oaks<-initial . ar (x, y) 
oaksw<-recursive {oaks$z) 
tswir [i, ] <-oaksw$tswir 
tswmad[i, ] <-oaksw$tswmad 
oakss<-studti2ed (oaks$z) 
tssir [i, ] <-oakss$tsstudir 
tssmad[i, ] <-oakss$tsstudmad 
usual [i, ]<“Oaksw$scaledres 
}  #  end  i 

return (tswir, tswmad, tssir, tssmad, usual) 


#  To  get  the  critical  value,  take  the  appropriate  quantile  of  the  large 

#  matrix  of  residuals  returned. 

#  j<-sk. critval (5000,  60,  6) 

#  j 

#  critval. mad<“quantile (j$tswmad, 0.975) 


#  Function  sk.ps  is  the  program  simulation  to  deteamaine  the  detection 

#  and  false  alarm  probabilities.  N  is  the  number  of  replications  the  rest 

#  of  the  parameters  are  to  generate  the  data  (no  col  of  l*s  needed). 

#  Of  the  four  possibilities,  we  consider  only  the  recursive  residuals 

#  (not  studentized)  and  using  the  MAD  estimate  (not  IR)  of  scale. 

sk . ps<- function (N, outl, out 2 , xshiftl, xshift2, yshif tl , yshift2, n,  k,  x) 


{ 


# 

# 


{ 

teststats<-NULL 


ppK-NULL 
id<-NULL 
plantdet<”NULL 
pplant<--0. 0 
pfalse<-0 . 0 
out  s  <~out 1 tout  2 
first<-n-outs+l 
last<-n-outs 
kp3<-k+3 

critval<-if (k==6) 
for  (i  in  1:N) 


#  total  outliers 

#  the  id  of  the  first  planted  outlier 

#  the  id  of  the  last  clean  observation 

#  determine  how  large  to  make  initial  subset 
2.380  else  2.460 


{ 

cat ("iteration  ”,i,"  ") 

set . seed ( i ) 

data<-gendata (outl, out2, xshiftl, xshift2, yshiftl, yshift2,n, k, x) 
sortdata<-sk. initial (data$x,data$y)  #  ordered  by  studentized  resid 
teststats<-sk. recursive (sortdata$z)  #  finds  recursive  residuals 
ppK-ifelse  (teststats$tsvraiad  >  critval,  1 . 0,  0 . 0)  iexceeds  crit  val 
idcol<-ncol(sortdata$z)  #  observation  number  location 

id<-sortdata$z[kp3:n,idcol]  #  respective  observation  vector 

plantdet<-ifelse(ppl==0.0  , 0, ifelse (id>last, 1, 0) )  #  if  detect  planted 

outlier  then  =1  else  =0 

false<-ifelse(ppl==0.0,0,ifelse(id>last,0,l) )  #  here  we've  exceeded 
the  critical  value  but  it  is  not  a  planted  outlier 
pplant<-pplant+sum(plantdet)  #  counter  for  planted  outliers 

pfalse<-pfalse+sum (false)  #  counter  for  false  alarms 

pp<-pplant/ (N*outs)  #  probability  of  detecting  planted  outlier 

po<-pfalse/ (N* (n-outs) )  #  probability  of  false  alarm 

} 
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return (data, pp, po, critval ) 

} 

j<-sk.ps (500, 6, 6, 5, 5, 10, 10, 60, 6, 3) 

j 

PENA  AND  YOHAI  . 

#  This  program  implements  the  procedure  from  Pena  and  Yohai,  JRSS  (B) 

#  ,  1995  to  detect  influential  subsets  in  regression.  The  crux  of 

#  the  procedure  evaluates  the  eigenstructure  of  the  influence  matrix. 

#  The  function  inflmatrix  creates  the  influence  matrix  M  and  outputs  the 

#  eigenvectors  of  this  matrix.  The  computational  version  given  in  the 

#  equation  in  section  4  is  used  as  we  assume  n»p 

# 

inf lmatrix<-function (x, y) 

{ 

{ 

x<-as .matrix (x) 
y<“as .matrix (y) 
n<-nrow (x) 
p<-ncol (x) 

res<-matrix ( 0 , nrow=n, ncol=l ) 

hat<-NULL 

vecone<-rep { 1 , n) 

olsfit<-lsf it (X, y, intercept=FALSE) 
infl<-ls .diag (olsfit ) 
res<-olsfit$resid 

E<-diag (res, nrow=n, ncol=n) #  make  diagonal  matrix  of  residuals 
E<-as .matrix (E) 

hat<-l/ {vecone-infl$hat)  #  make  diagonal  matrix  of  hii 
D<-diag (hat , nrow=n, ncol=n) 

D<-as .matrix (D) 

temp<-eigen (inf l$cov. unsealed)  #  eigenvectors  of  (x’x)^-l 

#  note  that  the  eigenvectors  differ  often  signficantly 

#  if  we  compute  (x*x)''-l  directly  (not  from  Is. diag)  unless 

#  we  specify  "digits”  to  be  sufficiently  large. 

B<-temp$ vectors 

B<-as .matrix (B) 

L<-diag(sqrt (terap$values) , nrow=p, ncol=p) 

L<“‘as  .matrix  (L) 

A<-B  %*%  L 
A<-as .matrix (A) 

EDXA<-E  %*%  D  %*%  X  %*%  A 

scaleit<~l . 00000000/ (sqrt (p) *infl$std.dev) 

P<-scaleit*EDXA 
PtPeig<-eigen ( t (P) %*%P) 

Meigvect<-“P  %*%  PtPeig$vectors 

} 

return (Meigvect , p ) 

} 

§ 

#  The  function  AUTOID  attaches  the  observation  number  associated  with  the 

#  eigenvector.  It  also  sorts  the  eigenvector  and  finds  conditions  for 

#  outliers.  Input  is  the  eigenvectors  from  the  influence  matrix  and 

#  the  critical  distance  k  to  declare  the  outlying  set. 

# 

autoid<-function (M, k) 

{ 

{ 

M<“as .matrix (M) 
n<-nrow (M) 
p<-ncol (M) 

#  cl  and  c2  are  the  constants  used  for  breakdown  adjustment 
cK-floor  (n/4) 
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c2<-floor (n/4 ) 

#  there  really  is  no  reason  to  have  both  cl  and  c2  since  we 

#  have  no  way  of  knowing  if  the  eigenvector  will  have  negative 

#  or  positive  values  for  the  outliers. 
id<-seq(n) 

ev<-array(0,dim=c(p,n,2))  #  initialize  the  sorted  eigenvector  ^  ^ 

outa<-array(0,dim=c(p,cl,2) )  #  initialize  the  array  for  a  values  of  4.1.b 

#  "outa”  values  go  with  the  positive  scores 
outb<-array(0,dim=c(p,c2,2))  #  initialize  the  array  for  b  values  in  4.1.b 
a<-matrix (0, nrow<-cl, ncol<-p)  #  initialize  the  the  values  for  a 
ida<-inatrix (0, nrow<-cl, ncol<-p)  #  matrix  has  the  observation  id  for  a 
values 

idb<~matrix (0^ nrow<-c2,ncol<-p)  #  idb  and  b  are  outb 
b<-matrix (0, nrow<-c2, ncol<“p) 
temp<~matrix ( 0 , ncol-2 , nrow=n) 
seta<“mat rix ( 0 , ncol=cl , nrow^p ) 

#  this  is  the  set  of  outlying  observations  from  the  "a"  vector. 
setb<“mat rix ( 0 , ncol=cl , nrow=p) 

for  (i  in  l:p) 

{ 

temp<-'Cbind(M[,  i] ,  id) 

ev[i, , 3 <-temp [order (temp [,1] ), ]  #  now  eigenvectors  are  ordered 

for  (j  in  l:c2) 

#  we  need^  to  protect  against  the  situation  when  the  value  is  very  close  to  0 

#  such  as  .00027  when  we  divide  so  we  don't  get  false  alarms  for  the  wrong 

#  reason.  ev[3,l,2]  means  third  eigenvector,  first  row,  obs  id;  for  the  3rd 

#  dimension  if  use  1,  that  is  the  score.  We  assign  low  scores  the  median 

#  value  taking  into  account  if  it  is  the  positive  or  negative  score. 

medM<-median  (abs  (M)  ) 

if (abs (ev(i, (j+l) , 1] ) <  medM)  ev[i, ( j+1) , 1] < — medM 
b[ j , i]<-ev[i, j , 1] /ev[i, (j+l),l] 
idb[j,i]<-ev[i, j,2] 
idx<“j+ (3*c2) 

if (abs(ev[i, (idx-1) ,1] )<  medM)  ev[i, (idx-1) , 1] <-medM 
a [ j , i] <-ev [i, idx^ 1] /ev [i, (idx-1) , 1] 
ida [ j , i 3  <-ev [ i , idx, 2  3 
}  #  end  j 

outa [ i , r  3  <”Cbind ( a [ r i 3 / ida [ , i 3 ) 
outb [ i , , 3 <~cbind (b [ , i 3 # idb [ , i 3 ) 

}  #  end  i 

#  Now  we  form  the  set  of  observations  which  are  outliers.  There  are  p 

#  eigenvectors  but  we  only  use  cl  of  the  scores.  The  constant  k  is  key  here. 

#  It  measures  how  large  of  a  difference  between  two  scores  has  to  be  before 

#  declaring  the  set  outlying.  Simulations  show  a  value  of  2.5  is  perhaps  too 

#  small  based  on  the  number  of  false  alaintis.  The  authors  suggest  step  2 

#  (t  tests)  will  correct  the  false  alarm  problems. 

for  (i  in  l:p) 

{ 

for  (j  in  l:cl) 

if  (outa[i, j,13>k)  #  if  ratio  of  scores  for  positive  scores  >  k 

{ 

idx<-j 

while  (idx<=cl) 

#  take  all  observations  from  the  breakpoint  up  to  n  as  outliers. 

{ 

seta [i, idx3  <-outa [i, idx, 23 
idx<-idx+l 
}  #  endwhile 
}  #  end  if 

if  (outb[i,  j,13>3c)  #  ratio  of  scores  for  negative  scores 

{ 
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idx<-j 

while (idx>0) 

#  take  all  observations  from  the  breakpoint  back  to  1  (the  most  neg) 

{ 

setb [ i , idx] <-outb [ i , idx ,  2 ] 
idx<~idx-l 


}#endwhile 
}  #  endif 
}#end  j 
}#end  i 


} 

return (outa, seta, outb, setb) 

} 

# 

#  This  function  simulates  the  procedure  N  times  for  the  n  observations  each 

#  run.  An  N  X  n  matrix  called  obs  is  used  to  compute  the  proportion  of 

#  correctly  identified  observations  since  it  is  known  the  outliers  were 

#  planted  as  the  last  few  cases. 

#  This  is  for  genrand,  gendata,  gendatamed2, gendatamedl 

prog . sim<— function (N, out 1 , out2 , xshif t 1 , xshif t2 , yshif t 1 , yshif t2 , n, k, x) 

#  This  is  for  gendata2  ^ ^  ,  x 

#prog. sim<— function (N, outl, out 2, xsll, xsl2, xs21, xs22, yshif tl, yshift2, n, k, x) 

#  This  is  for  gendata6 

#  prog.sim<- 

function (N, outl , out 2, xsll , xsl2 , xsl3, xsl4 , xsl5 , xsl6, xs21 , xs22 , xs23 , xs24 , xs25 
, xs26, yshif tl, yshift2, n, k, x) 

{ 

{ 

out<-outl+out2 

firstout<-n-out+l 

lastclean<-n--out 

p<-k+l 

cK-floor  {n/4) 

obs<-matrix (0,nrow=N,ncol=n) 
for  (i  in  1:N) 


{ 

cat ("iteration  ",i,"  "^n,"  ") 
set .seed(i) 

#  This  generates  data  from  gendatamedl, 2,  genrand  or  gendata 

a<— gendata (outl, out2, xshiftl, xshift2, yshiftl, yshift2, n, k, x) 

#  This  generates  data  from  gendatbig2 

#  a<-gendatbig2 (outl, out2, xsll, xsl2, xs21, xs22, yshiftl, yshif t2, n, k,x) 

#  This  generates  data  from  gendatbig6 

#  a<“ 

gendatbig6 (outl, out 2 , xsll , xsl2 , xsl3 , xsl4 , xsl5, xsl6, xs21 , xs22 , xs23 , xs24 , xs25 
,xs26, yshiftl, yshift2,n, k,x) 
a$x<-cbind ( 1, a$x) 
b<-inf Imatrix (a$x, a$y) 
c<“autoid (b$Meigvect ,2.5) 

#  The  following  code  looks  at  the  observations  declared  outliers  from  both 

#  set  A  and  set  B.  If  the  observation  appears  in  either  set,  the  obs  matrix 

#  is  assigned  a  value  of  1.  This  avoids  the  double  counting  an  observation 

#  that  may  appear  as  an  outlier  from  two  separate  eigenvectors.  This  obs 

#  matrix  can  then  be  used  to  compute  any  statistic  of  interest  from  the 

#  simulation. 

for  (j  in  l:p) 

{ 

for  (1  in  l:cl) 

{ 

if  (c$seta[j,  1] >0) 

{ 

temp<-c$outa [ j ,  1, 2] 
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obs [i, temp] <-l 
}  #  end  if 
if  (c$setb[ j , 1] >0) 

{ 

temp<-c$outb [ j / 1 , 2 ] 
obs [i, temp] <-1 
}  #  end  if 
}  tend  1 
}  tend  j 
}  t  end  i 

avg<-apply  (obs,  2,mean)  .  .  ,  .  4.v 

t  pp  is  the  percentage  of  outliers  correctly  identified  while  pp  is  the 
t  probability  of  swamping  clean  observations, 
t 

pp<-mean(avg[firstout:n] ) 
po<'-mean(avg[l:  last  clean]  ) 

} 

return (a,pp,po) 

j  <“prog . sim (5, 4,4,5,5,5,5,40,2,2) 


ROUSSEEUW  and  VAN  ZOMEREN  ^  •  4-v. 

t  This  code  incorporates  the  Rousseeuw  and  van  Zomeren  (1990)  procedure  with 
t  both  rule  of  thumb/chi  square  and  simulated  critical  cut  off  values, 
t  The  subroutine  critvals  computes  the  simulated  critical  cutoff  values 
t  for  the  scaled  residuals  from  the  LMS  fit.  We  generate 

#  lots  of  N  *  n  clean  residuals  and  find  the  appropriate  percentiles. 

#  For  n  -  40,  k  =  2  use  3.61  for  98.75%  or  3.01  for  97.5% 

#  for  LMS.  Use  the  same  for  n  =  60,  k  =  6  for  LMS.  For  n  =  40,  k  -  2  use 

#  3.38  for  98.75%  and  2.87  for  97.5%  and  for  n  -  60  k  =  6  use  4.11 

#  for  98.75%  and  3.51  for  97.5%  for  LTS. 

# 

critvals<-function (N, n, k) 

{ 

{ 

resmat<-matrix (0,nrow=N,ncol=n) 
for  (i  in  1:N) 

{ 

set . seed (i) 

cat ("iteration  ”,  i,  "  ") 
datapts<-n*k 

#  generate  clean  x  matrix  for  a  run  multivariate  normal  (7.5,  4'^2) 

x<-matrix(rnorm(datapts, 7.5,4) ,ncol^k) 
y<-apply (5*x, 1, sum) 

y<-matrix ( y , ncol=l ) tmatrix ( rnorm (n) , ncol=l ) 

#  a<-lmsreg(x,y) 
a<“ltsreg (x, y) 

stdresi<-a$residuals/a$scale 
resmat [i, ] <“Stdresi 
} 

q95<-quantile (resmat, 0- 95) 
q975<-quantile (resmat, 0.975) 
q9875<-quantile (resmat, 0.9875) 
q05<-quantile (resmat , 0 . 05 ) 
q025<-quantile (resmat, 0 . 025) 
q0125<~quantile (resmat, 0.0125) 

} 

return (q95, q975, q9875, q05,  q025,  q0125) 

} 

#  b4<~critvals (50,  60,  6) 

#  b4 

# 
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#  The  subroutine  robdist  computes  the  robust  distances  with  MVE  estimator 

# 

robdist<-function (x) 

{ 

{ 

x<“as .matrix (x) 

n<-nrow (x) 

p<-ncol (x) 

transpx<“t (x) 

varcov<-var (x) 

mn<-apply (x, 2, mean) 

md<-mahalanobis (x,mn, varcov) 

#  This  section  computes  the  minimum  volume  ellipsoid  robust  distances 
v<-cov.mve (x) 

dmve<-mahalanobis (x, v$center, v$cov) 

} 

return (dmve^md) 


#  The  subroutine  reglms  computes  the  least  median  of  squares  regression 

#  residuals  and  returns  two  n  vectors:  1)  chkrot  is  1  if  the  observation 

#  residual  value  exceeds  the  rule  of  thumb  cutoff  else  it  is  0, 

#  2)  chksim  is  1  if  the  residual  exceeds  the  simulated  cutoff  value,  0  o.w. 

# 

reglms<- function (x, y) 

{ 

{ 

j<-lmsreg (x, y) 

#  Need  scaled  residuals 

stdresi<-abs ( j  $residuals/ j  $scale) 
chkrot <-ifelse (stdresi>2 . 5, 1,  0) 
chksim<“ifelse {stdresi>3. 61, 1, 0) 

} 

return ( chkrot , chksim, stdresi ) 

} 

#  The  subroutine  prog.sim  determines  the  probability  the  planted  outliers 

#  are  detected  and  the  false  alarm  probability  for  various  outlier  scenarios 
§ 

prog . sim<— function (N, out 1 , out 2 , out shf t 1 , out shf t2 , yshif tl , y shift 2 , n, k, x, iter) 

{ 

{ 

outs<-outl+out2  #  the  total  planted  outliers 

first<“n-outs+l  #  first  observation  that  is  a  planted  outlier 
last<-n-outs  #  last  clean  observation 

#  initialize  values 

summves<— 0  #  total  detected  with  simulated  R&vZ 

summvefs<”0  #  total  R&vZ  false  alarms 
summver<'-0  #  total  detected  with  original  R&vZ 
summvefr<-0  #  total  false  alarms  for  original  R&vZ 

#  critical  values  for  the  MVE  procedure  differ  with  parameters 
chicrit<-if (k-=2)7,3984  else  14.45 

mvesimcrit<“if (k==2) 9 . 3225  else  17.935 

#  generate  data  sets 
for  (i  in  1 :N) { 

cat ("you're  on  iteration  ",i,"  ") 

set . seed (i) 

data<-gendata (outl,out2,outshftl,outshft2, yshif tl, yshif t2,n, k,x) 
rdist<~robdist (data$x, iter) 

# 

#  The  MVE  procedure.  Note  the  simulated  critical  value  is 

#  17.935  for  the  97.5%  if  n  =  60  and  p  =  6  variables.  For  n  =  40  and 
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#  p  =  2,  then  we  use  the  simulated  value  as  9.9225.  Rousseeuw  and 

#  and  Zomeren  recommend  using  Chi  Square  with  p  degrees  of  freedom, 

#  For  our  case  that  would  be  p  =  2  degrees  of  freedom  so  the 

#  For  LMS,  the  recommendation  is  2.5  and  the  simulated  value  is  3.61 

#  (98.75th)  to  control  total  experimentwise  error  to  5% 

# 

resout<-reglms (data$x, data$y) 
ravems<-ifelse (rdist$dmve>mvesimcritr 1,0) 
resdiss<“ifelse (mvems+resout$chksim>0, 1,0) 
mveouts<-sum (resdiss [first : n] ) 

suinmves<--summves  +  mveouts 
mvefalses<”sum(resdiss [l:last] ) 
summvef  s<— summvef  s  +  mvefalses 

#  Rule  of  thumb  critical  values  from  Chi  Square  are  7.3984  for  p  =  2 

#  and  14.45  for  p  =  6.  These  are  the  critical  values  for  robust 

#  distances  based  on  alpha  =  0.025. 

mvemr<-ifelse (rdist$dmve>chicrit, 1, 0) 
resdisr<-ifelse (mvemr+resout$chkrot>0, 1,  0) 
mveoutr<-sum(resdisr [first :n] ) 
suinmver<-summver  +  mveoutr 
mvefalser<-“sum(resdisr  [l:last]  ) 
summvef  r<- summvef  r  +  mvefalser 
}  #end  for 

#  Statistics  for  all  the  runs,  pp  is  proportion  of  planted  outliers  detected 

#  po  is  the  probability  clean  observations  are  classified  as  outliers 

ppmves<-summves/ (N*outs) 
pomves<-summvefs/ (N* (n-outs) ) 
ppmver<-summver/ (N*outs) 
pomver<— summvef r/ (N* (n-outs) ) 

} 

return ( data , ppmves , pomves , ppmver , pomver ) 

} 

j<“prog.sim(5, 6, 6, 5, 5,  9,  9,  60,  6,  3, 1000) 

j 

REGRESSION  ESTIMATORS 

#  This  program  finds  outliers  using  the  rediduals  regression 

#  estimators.  The  first  step  is  finding  the  critical  cutoff 

#  value  to  determine  if  the  observation  is  an  outlier.  Next, 

#  the  observations  are  classified  for  the  run  and  tallied  over 

#  the  number  of  replications . 

#  make  sure  you  load  robeth  library  >library (robeth) 

#  The  subroutine  critval  calculates  the  quantiles  of  clean  data  for 

#  various  n  and  k.  These  are  the  avg  of  2.5th  and  97.5th  quantiles 

#  For  M,  use  1.85  for  both  n  =  60,  k  6  and  n  =  40,  k  =  2 

#  For  LTS  use  3.56  for  n  =  60  and  k  -  6  and  2.87  for  n  =  40,  k  =  2 

#  For  LMS  use  3.01  for  both  sample  sizes. 

#  For  MM  use  1.90  for  both 

#  For  Simpson,  use  1.981  for  both 

#  For  BM  and  OLS (Walker  GM) ,  use  1.960  for  both 

#  For  CH  (Coakley  Hettmansperger) ,  use  2.084  for  both 

# 

critval<“function (N, n, k) 

{ 

{ 

values<-matrix (0, nrow=N, ncol=n) 

obs<-n*k 

for  (i  in  1:N) 

{ 

cat ("iteration  ",i,"  ") 
set .seed (i) 

x<-matrix (rnorm(obs, 7 .5,4) ,ncol=k) 

y<“apply (5*x, 1, s\m)  +  matrix (rnorm(n) , ncol=l) 
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#  put  in  whatever  regression  estimator  you  want  the  quantiles  for 
#  in  the  next  line,  chreg,  Itsreg,  Imsreg,  Isfit,  myhbhe  (MM),  rreg  (M) 
a<-bmreg (x, y) 
values [i, ] <-a$residuals 
}  #  end  for 

q95<-quantile (values, 0 . 95) 
q975<“quantile (values, 0. 975) 

#  use  98.75  in  Rousseeuw  and  van  Zomeron  type  applications 

#  to  keep  experiment wise  error  to  a  total  5% 
q9875<-quantile (values, 0 . 9875) 
q99<-quantile (values,  0.99) 
q05<-quantile (values, 0.05) 
q01<-quantile (values, 0.01) 
q0125<-quantile (values, 0 . 0125) 
q025<-quantile (values,  0.025) 

return (q95, q975, q9875, q99, q05, q025, q0125, qOl ) 

} 

#b<-critval (1000,  60,  6) 

#b 

# 

#  The  function  prog.sim  generates  the  datasets  and  determines  the 

#  probability  the  residuals  detect  the  outliers  and  the  false  alam 

#  probability. 

prog . sim<- function (N, outl , out2, xsl , xs2 , yshiftl, yshift2, n, k, x) 

{ 

{ 

outs<-outl+out2 

first<-’n-outs+l 

last<“n-outs 

sumfalseols<'’0 

sumdetectols<-0  r 

sumfalsebm<-0 

sumdet  ectbm<-  0 

sumfalsech<’-0 

sumdet ectch<-0 

sumfalse j  s<-0 

sumdetect j s<-0 

sumdetectlms<“0 

sumfalselms<-0 

sumdet ectlts<-0 

sumfalselts<-0 

sumfalsemm<“0 

sumdetectmm<-0 

sumdetectm<-0 

sumfalsem<-0 

#  only  the  critical  value  of  LTS  estimator  depends  on  dimension 
cvlts<-if (n==60)  3.56  else  2.87 

for(i  in  1:N) 

{ 

set. seed (i) 

cat ("iteration  ”,i,  "  ”) 

data<-gendata (outl, out2, xsl, xs2, yshiftl, yshift2,n, k, x) 

a<-bmreg (data$x, data$y) 

b<-chreg (data$x, data$y) 

c<-lmsreg (data$x, data$y) 

d<“myhbhe (data$x, data$y ) 

e<-lsf it (data$x, data$y) 

f<-bi j  s5sa (data$x, data$y) 

g<-rreg (data$x, data$y) 

h<“ltsreg (data$x, data$y ) 

#  Bounded  influence  estimator  (Walker) 

outliersbm<“ifelse (abs (a$residuals) >1 . 96, 1, 0) 


falsebm<-suiti(outliersbin[l:last]  ) 
sumfalsebm<-siamfalsebm+falsebm 
detectbm<-suin (outliersbm [ first : n]  ) 
sumdetectbnK-sumdetectbm+detectbm 

#  CH-  Coakley  Hettmansperger  compound  estimator 

outliersch<-’ifelse(abs  (b$residuals)  >2 . 084 , 1,  0) 
falsech<-sum (out liersch [1 : last] ) 
sumfalsech<-sumfalsech+falsech 
detectch<-sum (out liersch [ first : n] ) 
sumdetectch<-sumdetectch4-detectch 

#  OLS 

outliersols<-ifelse (abs (e$residuals) >1 . 96, 1,0) 
falseols<~sum(outliersols [1 :last] ) 
sumfalseols<“Sumfalseols+falseols 
detectols<-simi(outliersols [first :n] ) 
sumdetectols<“Sumdetectols+detectols 

#  LTS 

outlierslts<”ifelse (abs (h$residuals) >cvlts, 1,0) 
falselts<-sum(outlierslts [l:last] ) 
sumfalselts<-sumfalselts+falselts 
detectlts<-sum(outlierslts [first :n] ) 
sumdetectlts<-sumdetectlts+detectlts 

# 

#  LMS 

outlierslms<-ifelse (abs (c$residuals) >3 . 01, 1, 0) 
falselms<-sum( outliers 1ms [1 : last] ) 
sumfalselms<“sumfalselms+f  alselms 
detectlms<-sum(outlierslms  [first  :n]  ) 
sumdetectlms<“siamdetectlms+detectlms 

#  Simpson  and  Montgomery  estimator 

outliers js<-if else (abs ( f$residuals) >1 . 981, 1, 0) 
f alsej  s<-sum (outliers j  s [ 1 : last] ) 
sumf  alse j  s<“sumf alsej  s+f alsej  s 
detectjs<-sum(outliersjs [first :n] ) 
sumdetect j  s<~sumdetect j  s+detect j  s 

#  M 

outliersm<-’ifelse  (abs  (g$residuals)  >1 . 85, 1,  0) 
falsem<-sum(outliersm[l : last]  ) 
sumfalsem<^sumfalsem+falsem 
detectm<-s\jm (outliersm[ first :n] ) 
sumdetectm<-siamdetectm+detectm 

#  MM 

outliersmm<-ifelse (abs (d$rsl) >1 . 90, 1, 0) 
falsemm<“-s\am (outliersmm[l : last] ) 
sumf  a  1  s  emm<- sumf  a  1  s  emm+ f  a  1  s  emm 
detectmm<-sum(outliersmm[first :n]  ) 
siamde  t  e  c  tmm<- sumde  t  e  c  tmm+de  t  ec  tmm 
}  #end  for 

} 

ppbm<-sumdetectbm/  (N*outs) 
pobm<-’Sumfalsebm/  (N*last) 
ppch<-’sumdetectch/  (N*outs) 
poch<-sumfalsech/  (N*last) 
ppols<“SVimdetectols/ (N*outs) 
pools<-sumfalseols/ (N*last) 
pplts<*-sumdetectlts/  (N*outs) 
polts<-sumfalselts/ (N*last) 
pp 1ms <- sumdetect 1ms/ (N*outs) 
polms<-sumfalselms/ (N*last) 
ppjs<-sumdetect js/ (N*outs) 
pojs<-sumfalsejs/ (N*last) 
ppm<~sumdetectm/ (N*outs) 
pom<~sumfalsem/ (N^last) 
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ppinm<-sumdetectnan/  (N*outs) 
poinm<-sumfalseinm/  (N*last) 

return (data , ppbm, pobm, ppch, poch, ppols , pools , pplts , polts , pplms , points , pp^  s , po 
j  s ,  ppm,  pom,  ppmm,  pomm) 


} 

j<-prog. sim (500, 3,3,5,5,5,5,40,2,2) 

j 


Appendix  B 


S-Plus  Code  and  Data  for  Chapter  4  Studies 
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#  LEVERAGE  STUDY  , 

#  The  leverage  study  evaluates  several  robust  distance  measures,  Tne 

#  MVE  and  MCD  estimates  of  mean  and  covariance  matrix  are  available  internal 

#  to  S-Plus  with  the  command  cov.mve,  cov.mcd.  Also  built  in  is  the 

#  function  mahalanobis  which  can  calcute  the  robust  distances  if  the  mean 

#  and  covariance  matrix  are  supplied.  Distances  for  M-estimates  of 

#  covariance  are  available  from  the  ROBETH  library  and  the  Simpson  and 

#  Montgomery  compound  estimator  code  in  Ch  5  shows  how  to  do  that.  The 

#  code  for  Hadi  (1992,  1994) is  not  shown,  but  is  available  from  his  web 

#  site.  The  code  to  implement  the  C++  version  of  R&W  is  shown  below. 

#  ROCKE  AND  WOODRUFF  PROCEDURE 

#  The  robust  distances  from  Rocke  and  Woodruff _  (1996)  is  written  in  C++. 

#  A  callable  S+  routine  can  be  formed  by  creating  a  dynamic  data 

#  link  in  a  C++  compiler.  This  is  not  a  trivial  process.  The  dll 

#  is  called  "multoutlier.dll”  and  is  accessed  via 

#  >dll.  load  ("c:  WmydirWmultoutlier. dll”,  "MultOut",  ”cdecl") 

# 

#  The  function  multo  actually  calls  the  C++  code.  The  input 

#  values  are  the  data  set  x  and  the  number  of  variables  p.  The  next 

#  5  values  in  the  function  are  output  from  the  dll.  mn  and  cov  are  the 

#  robust  mean  and  covariance  matrix  estimates,  dist  is  the  n  vector  of 

#  robust  distances,  rej  is  the  simulated  critical  cutoff  value  and 

#  status  is  an  internal  report  for  algorithm  function.  Note  initial 

#  values  must  be  input  during  a  function  call. 

# 

# 

multo<-‘ function (X,  p,  mn,  cov,  dist,  rej,  status) 

{ 

.C( "MultOut”, 

as .double (x) , 
as .double (p) , 
as .double (mn) , 
as .double (cov) , 
as .double (dist) , 
as .double (rej ) , 
as . integer ( status ) ) 

} 

#  The  function  rocky  gets  the  R&W  robust  distances  and  cutoff  values . 

#  The  robust  distances  from  MVE,  MCD  and  the  Mahalanobis  distance  can 

#  also  be  calculated  (Ch  4  leverage  study)  if  desired. 
rocky<-function (x,  iter) 

{ 

{ 

X  <-  as. matrix (x) 
n  <-  nrow(x) 
p  <-  ncol (x) 

#  This  section  computes  the  Mahalanobis  distance 

transpx  <-  t(x) 

#  varcov  <-  var(x) 

#  mn  <-  apply (x,  2,  mean) 

#  md  <“  mahalanobis (x,  mn,  varcov) 

#  This  section  computes  the  minimum  volume  ellipsoid  robust  distances 

#  v<-cov.mve (x) 

#  dmve<-mahalanobis(x,v$center, v$cov) 

#  This  section  computes  the  minimum  covariance  determinant  robust  distances 

#  cd<-cov.mcd (x) 

#  dmcd<~mahalanobis (x, cd$center, cd$cov) 

#  The  initialize  variables  for  the  Rocke  and  Woodruff  algorithm 

status  <—  0  #  returns  an  error  code  if  encountered  like  singular  data 


233 


rejdis  <-  3  #  if  the  robust  distance  is  beyond  this,  then  an  outlier 

dist  <“  rep(0,  n)  #  initial  robust  distance  vector 

#  These  are  the  inputs  to  the  Rocke  and  Woodruff  procedure,  p  is  number  of 

#  variables,  n  is  the  number  of  observations,  iter  specifies  how  many 

^  it03^ations  are  used  for  the  smooth  estimators  (Tukey  Biweight  M““estimate) 

#  The  authors  recommend  n^2  iterations  and  this  is  a  sensitive  parameter. 

#  the  next  two  O’s  use  the  default  values  for  the  seed  and  lamba  multiplier 

#  0.05  is  used  as  the  alpha  value  for  the  cutoff  value 

#  the  last  two  0*s  are  used  for  the  simulation  tolerance  and  trace  options 

parms  <-  c(p,  n,  iter,  0,  0,  0.05,  0,  0) 

j  <-  multo (transpx,  parms,  mn,  varcov,  dist,  rejdis,  status) 

distance  <-  as  .numeric  (j  [  [5]  ]  )  #  the  5"^"^  output  is  the  robust  distance 

reject  <-  as .numeric (j [ [6] ] )  #  the  sixth  is  the  cutoff  value 

#  add  md,  dmve,  and  dmcd  to  the  return  list  if  desired 
return ( di stance ,  re j  ect ) 

} 

#  INITIAL  ESTIMATOR  STUDY 

#  This  is  the  proposed  initial  estimator  in  Chapter  4  PI  that  uses  a 

#  high  breakdown  R&W  filter  to  clear  high  leverage  observations 

#  followed  by  a  high  breakdown  MM  estimate  to  remove  the  outliers  on 

#  interior  X-space.  The  coefficients  are  estimated  with  OLS  on  the 

#  observations  that  remain. 

# 

Pl<“function (x, y, iterat=1500) 

{ 

{ 

x<-as .matrix (x) 
p<-ncol(x) 

#  upon  entry  find  the  unusual  observations  in  X-space  with 

#  Rocke  and  Woodruff  procedure 
rwdist<-rocky (x, iterat) 

good<“if else (rwdist$distance<rwdist$rej ect, 1,0) 
goodx<-x [good==l, ] 
goodxint<-cbind ( 1 , x [ good==l , ] ) 
goody<-y [ good=-l ] 

#  MM  estimator  from  ROBETH  library  on  low  leverage  observations 
mm<”myhbhe ( goodxint , goody) 

#  MM  estimator  internal  to  SPLUS  (many  problems  with  datasets 

#  in  Chapter  4) 

#  mm<->lmRobMM  (goody-goodx,  ef  f  iciency=^ .  90 ) 

#  The  simulated  critical  value  for  2  tailed  95%  is  1.90  for  both 

#  n=60,  k  =  6  and  n  -  40,  k  =  2. 
cleanx<“goodx [abs (mm$rsl ) <1.90,] 
cleany<“goody [abs (mm$rsl) <1- 90] 
initest<-lsf it (cleanx, cleany) 

#  find  out  how  many  observations  were  used  in  the  OLS  fit. 
percentobs<-'length(initest$residuals)  /length  (y) 

#  how  many  observations  removed  for  high  leverage. 
pctobsx<-length (goody) /length (y) 
predval<“initest$coef %*%t (xint) 

resids<-y-t (predval) 
medres<-'  median  (abs  (resids) ) 

#  median  of  absolute  residuals 

scale  <-  1.4826  *  (1  +  (5/ (length  (y)  -  p  --  1)))  *  medres 

#  1ms  scale  estimate 

lisMrobdist=rwdist$distance, reject=rwdist$reject , coef^initest$coef  ,pctobsx=pc 
tobsx, percent obs=percentobs, scale=scale, residuals=resids) 

#  The  function  P2  is  the  initial  estimate  formed  by  clearing  high  leverage 

#  points  with  Rocke  and  Woodruff,  followed  by  an  MM  estimate  on  the 

#  remaining  observations 
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P2<-function (x, y, iterat=1500 ) 

{ 

{ 

x<-as. matrix (x) 
p<-ncol (x) 

#  Rocke  and  Woodruff  procedure  clears  high  leverage  observations 
rwdist<-“rocky  (x,  iterat ) 

good<-ifelse (rwdist$distance<rwdist$reject, 1, 0) 
goodx<“X [good=“l , ] 
goodxint<”Cbind (1, X [good==l/ ] ) 
goody<-y [good==l] 

#  MM  estimator  from  ROBETH 

mm<-myhbhe  ( goodxint ,  goody )  ■  v.  >1  \ 

#  Internal  Splus  MM  estimator  (had  trouble  with  data  sets  in  Ch  4) 

#  inm<-lmRobMM(goody-goodx,efficiency=. 90) 
percentobs<- length (goody) /length (y) 
xint<~cbind ( 1 , x ) 
predval<-mm$thetal  %*%t(xint) 
resids<-y-t (predval) 
medres<-median(abs (resids) ) 

#  median  of  absolute  residuals 

scale  <-  1.4826  *  (1  +  (5/ (length (y)  -  p  -  1) ) )  *  medres 

#  1ms  scale 

lisUrobdist=rwdist$distance, reject=rwdist$reject , coef=mm$coef , percentobs^perc 
entobs, scale=scale, residuals=resids) 

#  For  the  P3  initial  estimator,  substitute  the  function  "sest(x,y)"  for 

#  "myhbhe(x,y)"  in  P2  and  "coef"  instead  of  "thetal" 

§ 

#  This  is  the  S  estimate  function  using  the  ROBETH  library. 

sest<“function(x,y) { 

{ 

#  need  column  of  ones 

x<-as .matrix (x) 

x<“Cbind (1, x) 

y<“as .matrix (y) 

np<-ncol(x) 

nppl<-np+l 

dfvals ( ) 

dfrpar (x, ’ S’ ) 

ribetu(y) 

zr<-hysest (x, y^ nppl, iopt=l, intch=l, iseed=5431) 

coef<-zr$theta [1 :np] 

smin<-zr$smin 

rs<“zr$rs 

nrep<-zr$nrep 

cov<-”zr$cov 

ierr<-zr$ierr 

df comn ( ipsi=4 , xk=l . 5477 ) 

S.w<-Psi (rs/smin) / (rs/smin)  #weights 

list (coef^coef , resid=rs, nrep=nrep, smin=smin, cov=cov, lerr^ierr, w-S . w) 

#  The  following  is  the  code  for  the  proposed  compound  estimator  CEPl.  The 

#  initial  estimate  is  PI  (OLS  estimate  after  R&W  and  MM  filter) .  For  CEP2 

#  just  use  P2  instead  of  PI  in  the  init  argument.  LMS  measure  of  scale 

#  and  pi  weights  from  the  R&W  robust  distances  are  used  in  both  estimators. 

^  The  other  components  follow  that  of  the  Simpson  and  Montgomery  estimator. 

#  Do  not  include  column  of  Is  in  for  x  matrix. 

CEPK-function (x,  y,  w  =  rep(l,  nrow(x)),  int  =  TRUE,  init  =  Pl(x,y), 

method  =  wt .bibi square,  wx,  iter  -  1, 

acc  ==  50  *  .Machine$single.eps^0.5,  test.vec  =  ’’resid") 
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coefin  <-coef  <-  init$coef 
X  <-  as .matrix (cbind (1,  x) ) 
if ( 'missing (wx) )  { 

if (length(wx)  !=nrow(x))  ^ 

stop(”Length  of  wx  must  equal  number  of  observations  ) 

if (any (wx  <  0) ) 

stop ("Negative  wx  value”) 
w  <-  w  *  wx 

} 

if(ncol(x)  length(coef)  )  ^  ..x 

stop ("Must  have  same  number  of  initial  values  as  coefficients  ) 

resid  <-  init$residuals 

#  Determine  the  tuning  constant  based  on  the  suggestion  of  Marazzi  and 

#  Joss  (1993) 
tc_4 . 685 
xwt_as .matrix (x) 

# 

#  Scale  the  distances  such  that  the  median  distance  is  unity  and  all  others 

#  are  a  ratio  of  the  R&W  distance  to  the  median  R&W  distance 

rockwood • di s<“ ini t$robdis /median ( init$robdis ) 
pi<“l /rockwood . dis 

#  LMS-estimator  scale  estimate 
scale  <-  init$scale 

#  IRLS  step 

for (liter  in  Iriter)  { 

epis_c  (resid/ (scale*pi) )  ^ 

#  In  case  the  residuals  go  to  zero,  keeps  the  weight  =  1  (vs  undefined) 

if (any (resid  ==  0) )  { 

for  (i  in  1 : length (y) )  { 

if  (resid[i]  ==  0) 

W[i]  <-  1  r 

else 

w[i]  <-  method (epis [i] ,  tc) 

} 

} 

else 

#  Tukey  biweight 

w  <-  method (epis, tc) 

#  if  any  weights  are  missing  set  them  to  0.9999  and  write  them  to  a  file 

#  if (any (is .na (w) ) ) 

#  break 

# 

if ( Imissing (wx) ) 
w  <“  w  *  wx 

temp  <-  lsfit(x,  y,  w,  int  =  FALSE) 
coef  <-  temp$coef 
resid  <“  temp$residuals 

} 

if ( Imissing (wx) )  { 

tmp  <-  (wx  1=  0) 
w[tmp]  <-  w[tmp] /wx[tmp] 

list (coef  =  coef,  initialest  =  coefin,  init .pct=init$percentobs, residuals 
resid,  scale  =  scale,  tc  =  tc,  distances  =  rockwood. dis, 

piweight  =  pi,  erroverpis  =  epis,  w  =  w,  int  =  int) 

#  The  proposed  estimators  CEP3  and  CEP4  use  the  Simpson  and  Montgomery  shell 

#  and  integrate  the  R&W  distances  into  the  code.  CEP4  is  the  same  code 

#  except  iter— 4 . 

CEP3<-  function(x,  y,  robdis=rocky (x,  1500) ,  w  =  repd,  nrow(x)),  int  =  TRUE, 
init  =  sest(x,y),  method  =  wt .bibi square,  wx,  iter  =  3,  acc  =  50  * 
.Machine$single.eps''0.5,  test.vec  =  "resid") 
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{ 

#  rockdis<-rocky{Xr 2000)  ,  ^  v 

rockdi sin<”robdis$distance /median (robdis$distance) 

if(int)  { 

coef  <-  init$coef 
coefin  <-  coef 
X  <“  cbindd,  x) 

} 

init  <-  sest (x, y, int=FALSE) 
coefin  <—  coef  <“  init$coef 
X  <~  as. matrix (x) 

} 

if ( !missing(wx) )  { 

if (length(wx)  !=nrow(x))  ^ 

stop ("Length  of  wx  must  equal  number  of  observations  ) 

if (any (wx  <  0) ) 

stop ("Negative  wx  value") 
w  <-  w  *  wx 

} 

if(ncol(x)  !=  length(coef) )  . 

stop ("Must  have  same  number  of  initial  values  as  coefficients  ) 

resid  <-  y  -  x  %*%  coef  . 

#  Determine  the  tuning  constant  based  on  the  suggestion  of  Marazzi  and 

#  Joss  (1993) 
tc_4 . 685 

if  (int==F)  xwt_as.matrix(cbind(l,x) ) 
else 

xwt_as .matrix (x) 

#  Robeth  pi  weights  using  the  scatter  matrix 

dfrpar(xwt,  "Kra-Wel") 

#  Weights 

z  <-  wimedv(xwt) 

z  <-  wynalg(xwt,  z$a) ;  nitw  <“  z$nit 

#  Scale  the  distances  such  that  the  median  distance  is  unity  and  all  others 
^  are  a  ratio  of  the  actual  distance  to  the  median  distance  ^ 

#  If  any  of  the  design  points  are  at  the  design  center  (z$dist=0) 

if (any (z$dist  <=!)){ 

for  (i  in  l;length(y))  { 

if  (z$dist[i]  <=  1)  z$dist[i]  <-  1 

} 

} 

z$distm  <-  z$dist/median(z$dist) 
pi<-l/rockdism 

#  pi  <~  l/z$distm 

#  S-estimator  scale  estimate 
scale  <-  init$smin 

for (liter  in  l:iter)  { 
if (scale  ==  0)  { 
convi  <-  0 
method. exit  <-  TRUE 

status  <-  "could  not  compute  scale  of  residuals" 

} 

else  { 

epis  c (resid/ (scale*pi) ) 

#  In  case  the  residuals  go  to  zero,  keeps  the  weight  =  1  (vs  undefined) 

if(any(resid  =“  0) )  { 

for  (i  in  l:length(y))  { 
if  (resid[i3  ~  0) 
w[i]  <-  1 

else 

w[i]  <-  method (epis [i] ,tc) 

} 
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} 

else 

w  <-  method (epis, tc) 

#  if  any  weights  are  missing  set  them  to  0.9999  and  write  them  to  a  file 

#  if (any (is .na (w) ) ) 

#  break 

# 

if  ( Imissing (wx) ) 
w  <-  w  *  wx 

temp  <-  lsfit(x,  y,  w,  int  ^  FALSE) 
coef  <“*  temp$coef 
resid  <-  temp$residuals 
} 

} 

if ( Imissing (wx) )  { 

tmp  <-  (wx  !=  0) 
w[tmp]  <-  w[tmp] /wx[tmp] 

list (coef  =  coef,  initialest  =  coefin,  residuals  =  resid,  scale  -  scale, 
==  tc,  distances  =  rockdism, 

piweight  =  pi,  erroverpis  =  epis,  w  =  w,  int  =  int) 

} 

wt . bibisquare_ 

#  bounded  influence  WEIGHT  FUNCTION  where  w(t)  =  psi  (t)  /  t  and 

#  t  =  e  /  pi*s  The  Bisquare  psi  function 

#  user  supplied  tuning  constant 
function (u,  tc=4.685) 

{ 

U  <-  abs(u/tc) 

si  <-  u*(l  -  (u/tc)^2)"^2 

si[U  >  1]  <-  0 

w  <”  si/u 

w 

} 


tc 
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Monte  Carlo  simulation  data  for  Figure  4. 1  showing  tiie  area  of  coverage  for  the  various  compound 
estimators.  Each  cell  gives  the  proportion  of  times  out  of  50  replicates  that  the  procediue  assigned  a 
standardized  residual  value  of  2.5  ot  greater. 
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0.00 

0.00 

1.00 

0.36 

0.82 

5 

7 

0.00 

0.02 

0.96 

0.52 

0.86 

5 

8 

0.00 

0.00 

0.96 

0.70 

0.92 

5 

9 

0.00 

0.10 

0.96 

0.72 

0.80 

6 

3 

0.00 

0.00 

0.07 

0.06 

0.40 

6 

4 

0.00 

0.00 

0.84 

0.28 

0.56 

6 

5 

0.00 

0.00 

0.92 

0.32 

0.58 

6 

6 

0.00 

0.00 

1.00 

0.48 

0.84 

6 

7 

0,00 

0.00 

1.00 

0.70 

0.94 

6 

8 

0.00 

0.00 

1.00 

0.84 

0.96 

Appendix  C 


S-Plus  Code  for  Chapter  5  Studies 
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#  The  function  bijs5  is  the  abbreviated  version  of  the  Simpson  and 

#  Montgomery  (1998) estimator .  It  provides  only  coefficient  estimates, 

#  residuals  and  final  weights  for  computational  considerations. 
bijs5<“function(x,  y,  w  =  rep(l,  nrow(x)),  int  =  TRUE,  init  =  fastsest (x,  y) , 

method  —  wt .bibisquare,  wx,  iter  =  1, 
acc  =  50  *  .Machine$single.eps^0.5,  test.vec  -  "resid") 

{ 

{ 

coef  <-  init$coef 
X  <-  cbind(l,  x)  #  w  <-  w  *  wx 

resid  <-  y  -  x  %*%  coef  #  Determine  the  tuning  constant  based  on 
the  suggestion  of  Marazzi  and  Joss  (1993) 
tc  <~  4.685 

xwt  <-  as. matrix (x)  #  Robeth  pi  weights  using  the  scatter  matrix 

dfrpar(xwt,  "Kra-Wel”)  #  Weights 
z  <-  wimedv(xwt) 

z  <“  wynalgixwt,  z$a)  _  ^ 

nitw  <-  z$nit  #  Scale  the  distances  such  that  the  median  distance 

is  unity  and  all  others  are  a  ratio  of  the 

#  actual  distance  to  the  median  distance 

z$distm  <-  z$dist/median(z$dist) 

pi  <-  i/z$distm  #  S-estimator  scale  estimate 

scale  <“  init$smin 

epis  <-  c (resid/ (scale  *  pi)) 

w  <“  method (epis,  tc) 

temp  <“  lsfit(x,  y,  w,  int  =  FALSE) 

coef  <-  temp$coef 

resid  <-  temp$residuals 

if ( ‘missing (wx) )  { 

tmp  <-  (wx  !=  0) 
w[tmp]  <-  w[tmp] /wx[tmp] 

} 

list (coef  =  coef,  residuals  =  resid,  weight=w) 

#  The  function  fastsest  is  the  initial  S  estimator  for  the  abbreviated 

#  version  of  the  Simpson  and  Montgomery  compound  estimator. 
fastsest<-function (x,  y) 

{ 

{ 

#  need  column  of  ones 

X  <-  as. matrix (x) 

X  <-  cbind(l,  x) 
y  <-  as. matrix (y) 
np  <-  ncol(x) 
nppl  <“  np  +  1 
dfvals ( ) 
dfrpar(x,  ”S") 
ribetu (y) 

zr  <-  hysest(x,  y,  nppl,  iopt  =  1,  intch  =  1,  iseed  =  5431) 

coef  <-  zr$theta [1 :np] 

smin  <-  zr$smin 

rs  <~  zr$rs 

nrep  <-  zr$nrep 

cov  <-  zr$cov 

ierr  <-  zr$ierr 

dfcomn(ipsi  =4,  xk  =  1.5477) 

S.w  <-  Psi (rs/smin) / (rs/smin)  #weights 

} 

list (coef  “  coef,  smin  =  smin) 

#  The  function  gendatagm  generates  the  Gunst  and  Mason  (1980)  data  set 

#  used  in  the  Shao  (1993,  1996)  studies.  The  regressor  levels  are 
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#  always  the  Gunst  and  Mason  data  while  the  responses  are  generated 

#  from  the  known  beta  vector  +  N(0, sigma)  as  in  Shao.  The  input  parameters 

#  are  seedy  for  the  random  nxamber  seed  and  sigma  to  control  the  signal-to 

#  noise  ratio.  Shao  uses  sigma  =  1.0. 
gendatagm<-f unction (seedy, sigma) 

{ 


# 

# 

# 

# 

# 

# 

# 


var2  <—  c(0.36,  1.2,  0.06,  0.16,  0.01,  0.02,  0.56,  0.98,  0.32,  0.01, 
0.15,  0.24,  0.11,  0.08,  0.61,  0.03,  0.06,  0.02,  0.04,  0, 

0.09,  0.02,  0.02,  0.05,  0.11,  0.18,  0.04,  0.85,  0.17,  0.08,  0.38, 

0.11,  0.39,  0.43,  0.57,  0.13,  0.04,  0.13,  0.2,  0.07) 

varS  <-  c(0.53,  2.52,  0.09,  0.41,  0.02,  0.07,  0.62,  1.06,  0.2,  0, 

0.25,  0.28,  0.35,  0.13,  0.85,  0.03,  0.11,  0.08,  0.24,  0.02, 

0.18,  0.16,  0.11,  0.24,  0.39,  0.11,  0.09,  1.33,  0.32,  0.12,  0.18, 

0.13,  0.38,  0.46,  1.16,  0.03,  0.05,  0.18,  0.95,  0.06 

var4  <-  c(1.06,  5.74,  0.27,  0.83,  0.07,  0.07,  2.12,  2.89,  0.76,  0.07, 

0.5,  0.59,  0.4,  0.28,  0.49,  0.23,  0.5,  0.25,  0.08,  0.04, 

0.59,  0.24,  0.21,  0.43,  0.29,  0.43,  0.23,  2.7,  0.66,  0.49,  0.49, 
0.18,  0.99,  1.47,  1.82,  0.08,  0.14,  0.28,  0.41,  0.18) 

var5  <-  c(0.5326,  3.6183,  0.2594,  1.0346,  0.0381,  0.344,  1.4459, 
4.0182,  0.46,  0.154,  0.6516,  0.0611,  0.1922,  0.0931,  0.0538, 

0.0199,  0.0419,  0.1093,  0.0328,  0.0797,  0.1855,  0.1572,  0.0998, 
0.2804,  0.2879,  0.681,  0.3242,  2.6013,  0.4469,  0.2436, 

0.44,  0.3351,  1.3979,  2.0138,  1.9356,  0.105,  0.2207,  0.018, 
0.1017,  0.0962) 

The  following  line  augments  the  design  matrix  with  5  noise  variables 
if  desired 

noise<-matrix (rnorm(200) , nrow=40) 

X  <-  matrix (cbindd,  var4,  var5,  var2,  var3, noise),  ncol  =  10) 
Otherwise,  the  original  data... 

X  <“  matrix  (cbindd,  var4,  var5,  var2,  var3)  ,  ncol  =  5) 
beta  is  the  known  generating  vector.  Note  the  order  of  the  variables 
is  changed  to  reflect  Shao. 

beta  <-  matrix (c (2,  4,  8,  0,  0),  ncol  =  1) 
set .seed (seedy) 

error  <-  matrix (rnorm( 40,  0,  sigma),  ncol  =  1) 
y  <-  x  %*%  beta  +  error 
X  <“  x[,  “1] 


} 

return (x,  y) 

#  The  function  gendatagmc  generates  the  modified  Gunst  and  Mason  data 

#  used  in  Chapter  5  with  10%  outliers  planted.  The  example  in  Chapter 

#  1  uses  129  as  seedy.  The  variable  sigma  determines  how  much  noise 

#  is  added  and  if  it  is  >=  5,  no  estimator  works. 
gendatagmc<-f unction (seedy,  sigma ) 

{ 

{ 

j  <-  gendatagm (seedy,  sigma) 

beta  <”  matrix (c (2,  4,  8,  0,  0),  ncol  -  1) 

j$x  <-  cbind(l,  j$x) 

#  Plant  outliers  a  distance  of  10  sigma  away 

j$y[8]  <-  j$x[8,  ]  %*%  beta  +  10*sigma 

j$y[153  <-  j$x[15,  ]  %*%  beta  +  10*sigma 

j$y[28]  <-  j$x[28,  ]  %*%  beta  +  10*sigma 

j$y[39]  <-  j$x[39,  3  %*%  beta  +  10*sigma 

y  <-  j$y 
X  <-  j$x[,  -13 

} 

return (x,  y) 

} 

#  The  function  gendatanormc  generates  N(0,1)  regressors  and  calculates 
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#  the  response  by  multiplying  by  beta  and  adding  N{0, sigma)  noise. 

#  The  last  5  of  40  observations  are  outliers.  The  last  3  are  also  high 

#  leverage.  The  inputs  are  the  seed,  sigma  (amount  of  noise  added  to 

#  response),  dis  (residual  and  leverage  magnitude  for  the  outliers) 

#  out  (number  of  outliers) 

gendatanormc<“f unction (seedy,  out,  dis,  sigma) 

{ 

first<-40-out+l  # find  the  first  observation  to  make  an  outlier 
set . seed (seedy) 

X  <-  matrix (rnorm( 160) ,  nrow  —  40) 

beta  <“  matrix(c(2,  4,  8,  0,  0),  ncol  =  1) 

X  <“  cbind(l,  x) 

x[37:40,  2:5]  <-  x[37;40,  2:5]  +  dis 
y  <-  X  %*%  beta  +  rnorm(40,  0,  sigma) 
y[first:40]  <-  y[first:40]  +  dis 
X  <“  x[,  -1] 

} 

return (x,  y) 

#  auxiliary  function  matmin  finds  the  element  of  a  vector  that  is  the  minimum 

#  and  assigns  it  a  value  of  1  while  all  others  are  0. 
matmin<“f unction (x) 

{ 

minx  <-  min(x) 

minvec  <-  if else (x  ==  minx,  1,  0) 
return (minvec) 

#  auxiliary  function  matmax  finds  the  maximum  element  in  a  vector 
matmax<-‘ function  (x) 

{ 

maxx  <-  max(x) 

maxvec  <-  ifelse(x  ==  maxx,  1,  0) 
return (maxvec ) 

#  auxiliary  function  sstman  finds  the  total  sums  of  squares /n 
sstman<-function(isub,  y) 

^  sst  <-  sum((y[isub]  -  mean (y[isub] )) ^2) /length (y[isub] ) 
return (sst) 

#  auxiliary  function  costcol  finds  the  prediction  error  for  a  model. 

#  vector  y  is  the  original  data  and  vector  x  is  the  predicted. 
costcol<-function (x,  y) 

{ 

X  <-  matrix (x,  ncol  =  1) 
y  <-  matrix (y,  ncol  =  1) 
j  <-  sum((y  “  x) ^2) /length (y) 
return ( j ) 

#  The  function  regbootl  performs  regression  using  x[isub]  to  predict  y[isub] 

#  isub  is  a  vector  of  length  n, 

#  a  bootstrap  sample  from  the  sequence  of  integers 

#  1,  2,  3,  ...,  n 

# 

#  This  function  is  used  by  other  functions  when  computing 

#  bootstrap  estimates,  x  is  regressors  (without  intercept),  regfun  is 

#  the  regression  estimator  (Isfit,  bijs5),  m  tells  how  large  the  bootstrap 

#  sample  should  be  (full  sample  use  m=0) .  The  matrix  of  coefficients 

#  from  the  B  bootstrap  samples  and  the  bootstrap  prediction  error  for  each 

#  bootstrap  sample  (for  the  bias  correction)  are  returned. 

#  Also  returns  the  weighted  estimate  of  prediction  error  if  use  a  robust 

#  estimator,  if  use  Isfit,  then  comment  out  the  last  wtderr<-  line. 
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regboot l<-f unction (isub,  x,  y,  regfun,  m) 
{ 


# 

# 

# 

# 


{ 

wtcierr<-NULL 
nmm  <-  nrow(x)  -  m 

xmat  <-  matrix (x[isub,  ],  nrow  =  nmm,  ncol{x)) 
regboot  <-•  regfun(xmat,  y[isub]) 
coefficients  <-  matrix (regboot $coef,  ncol  =  1) 
xmat  <-  cbind(l,  xmat) 

bspe  finds  the  prediction  error  for  this  bootstrap  sample  using 

the  bootstrap  response  values.  This  is  needed  for  the  unbiased  estimate. 

bspe  <”  sum ( (regboot$residuals) ^2) /length (y[isub] ) 

wtderr  weights  the  prediction  error  with  the  compound  estimators  final 

weights.  ^  ,  r  •  i-n  > 

wtderr<-sum( ( (regboot$residuals) ^2) *regboot$weight ) /length (y [isub] ) 


list(coef  =  coefficients,  booterr  =  bspe,  wtderr^wtderr) 


#  The  function  willbs  executes  the  bootstrap  and  returns  the  average 

#  prediction  error  if  m  !=  0  or  the  bias  corrected  prediction  error 

#  if  m  -  0.  Note  that  x  does  not  have  a  column  of  Is. 
willbs<-function(x,  y,  data,  regfun  =  bijs5,  nboot  -  100,  m) 

{ 

{ 

X  <-  as. matrix (x) 
p  <-  ncol(x)  +  1 
y  <-  matrix (y,  ncol  “  1) 

bvec  <-  apply (data,  1,  regbootl,  x,  y,  regfun,  m) 

#  bvec  is  the  p+1  by  nboot  matrix.  The  first  row 

#  contains  the  bootstrap  intercepts,  the  second  row 

#  contains  the  bootstrap  values  for  first  predictor,  etc. 

bootpe  <-  NULL 
bootwtderr<-NULL 

coef  <-  matrix (0,  ncol  -  nboot,  nrow  =  p) 

#  this  piece  of  inefficient  code  extracts  the  coefs  and  bootstrap 

#  resiabstitution  error  for  each  bootstrap  sample. 

for(i  in  1: nboot)  { 

coef[,  i]  <-  bvec[ [ij ] $coef 
bootpe [ i ]  <“  bvec [ [ i ] ] $boot err 
bootwtderr [i}<-bvec[ [i] ] $wtderr 

#  The  n  by  nboot  matrix  of  predicted  values  using  the  bootstrap  coefficients 

#  contained  in  the  matrix  coef  and  the  real  x*s 

pred  <-  cbind(l,  x)  %*%  coef 

#  avg  prediction  error  vector  of  length  nboot  if  use  the 

#  bootstrap  predictions  and  the  observed  y's 

apevec  <-  apply (pred,  2,  costcol,  y) 

#  contains  the  vector  of  average  prediction  error 

#  If  use  m  =  0  (full  bootstrap  sample  size)  then  need  the  unbiased  estimate 

#  of  prediction  error.  resub  is  the  usual  resubstitution  error  using  only 

#  original  data. 

resub  <-  sum(  (y  -  cbindd,  x)  %*%  bijs5(x,  y)  $coef)  ^^2) /length (y) 
apevec . unbias  <—  (apevec  -  bootpe)  +  resub 
apevec . wtd<-bootwtderr 

} 

return (apevec,  apevec.unbias, apevec. wtd) 

#  The  function  win. pure  takes  a  vector  of  prediction  errors  and  returns  the 

#  dimension  of  the  best  model.  It  is  not  necessarily  the  model  with  the 

#  lowest  prediction  error.  Input  the  constant  (const)  for  minimum  change 

#  in  prediction  error  required  before  going  to  the  next  higher  dimension. 
win.pure<-function(prederr, const=. 025) 

{ 


=tt=  =tt=  =«:=«=  =tt  ^  =tt=  :4*=  ^  =♦»=  =*t  =»==«=  =»fc 
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Find  the  critical  value  for  minimum  change  in  prediction  error  from  the 
model  with  one  less  dimension  before  a  variable  •  _ 

can  be  added.  This  is  similar  to  the  CART  impurity  criteria  for  splitting 

nodes  Brieman  et  al.  (1984).  .  .  •„ 

min. purity  <-  const  *  prederr[l]  #  Determine  the  change  in  impuritie 

delta  <-  matrix (0,  nrow  =  1,  ncol  =  5) 
change  ncol  in  above  line  for  10  variable  models  for 

delta [1,1]  is  made  very  large  to  offset  the  vector  by  1  to  account  for 

the  intercept . 

delta [1,  1]  <-  100000 

delta [1,  2]  <-  prederr[l]  -  prederr[2] 
delta [1,  3]  <-  prederr[2]  -  prederr[3] 
delta [1,  4]  <-  prederr[3]  -  prederr[4] 
delta [1,  5]  <-  prederr[4]  -  prederr[5] 
delta [1,  6]  <-  prederr[5]  -  prederr[6] 
delta [1,  7]  <-  prederr[6]  -  prederr[7] 
delta [1,  8]  <-  prederr[7]  -  prederr[8] 
delta [1,  9]  <-  prederr[8]  -  prederr[9] 
delta  [1,  10]  <-  prederr[9]  -  prederr[10] 

the  pure  vector  determines  if  the  variable  should  be  added 
pure  <-  ifelse (delta  <  min. purity,  1,  0) 
create  an  index  of  to  choose  the  j— parameter  model 

j  <-  rep  (1:5,1)  ...t 

the  best  model  is  the  last  time  the  change  in  impurity  is  >  crit  val 
winner  <-  max (j [pure  ==  0]) 
additional  check  if  the  change  in  prediction 

error  occurs  for  high  dim  models  then  make  sure  it  is  lower  than 
prediction  error  for  the  true  5  variable  model.  This  really  should 
be  done  several  times. 

if (prederr[ winner]  >prederr[5])  { 
pure [winner]  <“  1 
winner  <-  max (j [pure  =  0]) 

} 

return (winner) 

shaosimgmn  executes  the  entire  bootstrap  simulation  by  inputting  the  number 
of  replications (iter) ,  the  number  of  outliers  in  the  sample  (out) ,  the 
magnitude  of  the  outliers  (dis),  the  noise  in  the  sample,  NID  (0, sigma) 
the  nxjmber  of  observations  to  remove  from  the  full  sample  for  the  bootstrap 
sample  (m) ,  and  a  seed. 


shaosimgmr<-f unction (iter,  out, dis, sigma, m, nboot,  seedy) 

{ 

#  initialize  the  values  of  the  matrices  that  store  the  number  of  times 

#  each  model  is  selected 

cumpct.shao  <-  matrix (0,  nrow  =  1,  ncol  =  5) 
cumpct.jw  <~  matrix (0,  nrow  =  1,  ncol  *=  5) 
cumwin.shao  <-  matrix (0,  nrow  -  1,  ncol  =  5) 
cumwin.jw  <“  matrix (0,  nrow  =  1,  ncol  —  5) 
cumpct .  shaowt  <-  matrix  (0,  nrow  =  1,  ncol  ==  5) 
cumpct.jwwt  <-  matrix (0,  nrow  =  1,  ncol  =  5) 
cumwin . shaowt  <-  matrix (0,  nrow  =  1,  ncol  =  5) 
cumwin.jwwt  <-  matrix (0,  nrow  =  1,  ncol  —  5) 

#  Replications 

for(i  in  l:iter)  { 
cat ("iter  ",  i,"  ") 
seeder  <-  seedy  +  i 

j  <-  gendatanormc (seeder,  out, dis,  sigma) 

#  data  is  the  bootstrap  resample  matrix  for  all  nboot  samples. 

p  <-  ncol(j$x)  +  1 
niran  <“  length  (j$y)  -  m 
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data  <-  matrix (sample (length (j$y) ,  size  =  nmm  *  nboot^ 
replace  =  T) ,  nrow  =  nboot) 

#  wbsl  calculates  the  nboot  prediction  errors  for  a  1  variable  model 

wbsl  <-  willbs(j$x[,  1],  j$y,  data,  nboot  -  nboot,  m  -  m) 

wbs2  <-  willbs(j$x[,  1:2],  j$y,  data,  nboot  -  nboot,  m  =  m) 

wbs3  <-  willbs(j$x[,  1:3],  j$y,  data,  nboot  =  nboot,  m  =  m) 

wbs4  <”  willbs(j$x[,  1:4],  j$y,  data,  nboot  =  nboot,  m  =  m) 

#  wbs5  <-  willbs(j$x[,  1:5],  j$y,  data,  nboot  =  nboot,  m  -  m) 

#  wbs6  <-  willbs(j$x[,  1:6],  j$y,  data,  nboot  -  nboot,  m  =  m) 

#  wbs7  <-  willbs(j$x[,  1:7],  j$y,  data,  nboot  =  nboot,  m  =  m) 

#  wbs8  <-  willbs(j$x[,  1:8],  j$y,  data,  nboot  -  nboot,  m  =  m) 

#  wbs9  <-  willbs(j$x[,  1:9],  j$y,  data,  nboot  =  nboot,  m  =  m) 

#  Now  that  we  have  the  contending  models  avg  pred  error  for  all  nboot  fits 

#  put  them  in  a  5  by  nboot  matrix  to  find  out  the  lowest  prediction  error  of 

#  the  5  contenders  in  each  of  the  nboot  trials.  We  first  find  the  model  with 

#  no  predictors  as  the  total  sum  of  squares. 

SST  <-  matrix (apply (data,  1,  sstman,  j$y),  nrow  =  1) 

#  Use  the  unbiased  prediction  error  if  m  =  0  (bootstrap  sample  size  =  n) 

#  if (m  I-  0) { 

#  For  the  outlier  study  (tab  5.13),  we  do  not  want  to  use  the  bias 

#  correction  for  the  full  sample  so  we’ll  bypass  it  since  m  will  never  =  3. 

if (m  1=  3) { 

j  <-  matrix (rbind (SST,  wbsl$apevec,  wbs2$apevec, 
wbs3$apevec,  wbs4$apevec) ,  nrow  =  5) 
jwt  <”  matrix (rbind (SST,  wbsl$apevec . wtd, 
wbs2$apevec . wtd,  wbs3$apevec.wtd,  wbs4$apevec . wtd) ,  nrow  =5)} 

else  j  <-  matrix (rbind (SST,  wbsl$apevec .unbias,  wbs2$ 
ape vec . unbias ,  wbs3$apevec- unbias,  wbs 4 $apevec. unbias,  wbs5$apevec- unbias, 
wbs 6$apevec . unbias ,  wbs 7 $apevec. unbias,  wbs 8$apevec .unbias, 
wbs 9$ apevec. unbias ) ,  nrow  =  10) 
wbsl  <-  NULL 
wbs2  <-  NULL 
wbs 3  <-  NULL 
wbs 4  <-  NULL 
wbs 5  <-  NULL 
wbs 6  <-  NULL 
wbs7  <-  NULL 
wbs 8  <-  NULL 
wbs 9  <-  NULL 

#  pct.shao  finds  the  selection  percentage  of  the  nboot  samples  using 

#  the  minimum  prediction  error  criteria. 

pct.shao  <-  apply (apply (j ,  2,  matmin) ,  1,  sum) 
pct.shaowt  <-  apply (apply (jwt,  2,  matmin),  1,  sum) 

#  cumpct.shao  tallies  this  percentage  over  the  iter  iterations 

cumpct . shao  <-  cumpct . shao  +  pet . shao 
cumpet . shaowt  <“  cumpct . shaowt  +  pet . shaowt 

#  win. shao  selects  the  model  with  the  lowest  average  prediction  error 

#  across  the  nboot  samples. 

win. shao  <-  matmin (apply (j ,  1,  mean)) 
win, shaowt  <-  matmin (apply (jwt ,  1,  mean)) 

#  cumwin.shao  tallies  the  winners  up  across  the  iter  iterations 

cumwin.shao  <-  cumwin.shao  +  win. shao 
cumwin. shaowt  <-  cumwin. shaowt  +  win. shaowt 

#  pct.jw  is  the  percentage  of  times  the  model  is  selected  in  the 

#  nboot  samples  in  purity  metric  rather  than  absolute  minimum  aggregate 

#  prediciton  error. 

jw  <“  apply (j,  2,  win. pure, const=. 005) 
jwwt  <-  apply (jwt,  2,  win. pure, const=. 001) 
pct.jw  <-  c(sum(ifelse(jw  ==  1,  1,  0)),  sum(ifelse ( jw  == 
2,  1,  0)),  sum(ifelse ( jw  -=  3,  1,  0)),  sum( 
ifelse(jw  ==  4,  1,  0)),  sum(ifelse ( jw  ==  5,  1, 
0))) 
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pct.jwwt  <-  c(s\im(ifelse(jwwt  “  1,  1,  0)), 
sum(ifelse(jwwt  ==2,  1,  0)),  sum(ifelse { jwwt  =-  3,  1,  0)),  sum(ifelse ( jwwt 
4,  1,  0))r  sum  (if else (jwwt  ==  5,  1,0))) 

cumpct.jw  <-  cumpct.jw  +  pct.jw 
cumpct.jwwt  <-  cumpct.jwwt  +  pct.jwwt^ 

#  win.jw  finds  the  best  model  based  on  the  average  of  prediction  error 

#  from  the  nboot  samples  with  the  minimum  change  in  prediction  error  metric. 

win.jw  <-  win. pure (apply (j ,  1,  mean) , const-. 005) 
idx  <-  rep(0,  5) 
idx[win.jw3  <--  1 

cumwin.jw  <“  cumwin.jw  +  idx  ^  clear  the  arrays 
win. jwwt  <”  win. pure (apply (jwt,  1,  mean) , cons t=. 001) 
idx  <-  rep(0,  5) 
idx [win. jwwt]  <-  1 

cumwin.jwwt  <—  cumwin.jwwt  +  idx 

j<-NULL 

jwt<-NULL 

} 

return (cumpct . shao,  cumwin.shao,  cumpct.jw,  cumwin.jw,  cumpct . shaowt , 
cumwin. shaowt ,  cumpct.jwwt,  cumwin.jwwt,  j,  jwt) 

#  The  function  cvpress  calculates  the  leave-one-out  estimate  of 

#  prediction  error  by  performing  n  regressions.  It  also  provides  the 

#  weighted  avg  prediction  error. 
cvpress<-f unction (x,  y,  method  bijsS) 

{ 

{ 

set .seed(129) 

X  <-  as. matrix (x) 
n  <-  nrow(x) 

y  <-  as. matrix (y,  ncol  =1) 
xint  <-  matrix (cbind (1,  x) ,  nrow  =  n) 
cvpred  <-  matrix (0,nrow=n,ncol=l) 
prederr<-matrix (0, nrow=n, ncol^l ) 

#  loop  through  all  n  observations  and  leave  one  out  each  time 

for(i  in  l:n)  { 

cvreg  <-  method (x[  -i,  ]/y[“i]) 

predvals  <-  xint  %*%  cvreg$coef 
cvpred [i]  <-  predvals [i] 

} 

prederr<- ( (y-cvpred) ^2) 

CV  <-  mean(prederr) 

#  create  diagonal  matrix  of  weights  from  robust  estimator 

reg<-method (x, y) 
wt<“diag(reg$weight,ncol=n) 

#  find  weighted  avg  prediction  error 

CVwt  <-  mean( (wt%*%prederr) ) 

} 

return ( CV,  CVwt ) 

#  The  function  cvlsim  is  the  full  simulation  for  the  leaye-one-out  estimate 

#  of  prediction  error.  Input  the  number  of  replicates  (iter), the  number  of 

#  of  outliers  (out),  the  magnitude  of  the  outliers  (dis),  the  noise  to 

#  generate  the  response  N(0, sigma)  and  the  seed  so  results  can  be  duplicated 

#  and  the  same  datasets  are  used  except  factors  altered.  If  other  data  sets 

#  are  used  like  Gunst  and  Mason,  you  don’t  need  all  those  parameters. 

#  estimator  is  the  regression  estimator  that  must  have  at  least  $coef  and 

#  $weight  for  weighted  avg  prediction  error.  Note  that  dis  is  delta*sigma  in 

#  table  5.14,  so  Sdelta  and  5sigma  means  dis=25  for  simulation. 
cvlsim<-function(iter,  out,  dis,  sigma, seedy, estimator) 

( 

{ 
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# 

# 

# 

# 

# 

# 

# 

# 


# 

# 

# 

# 

# 

# 

# 


# 

# 


# 


cumwin.shao  <-  matrix (0,  nrow  =  1,  ncol  =  5) 
cumwin.jw  <—  matrix (0,  nrow  =  1,  ncol  =  5) 
cumwin • shaowt  <-  matrix (0,  nrow  =  ncol  =  5) 
cumwin.jwwt  <-  matrix (0,  nrow  =  1,  ncol  =  5) 
for(i  in  l:iter)  { 

cat(”iter  ”,  i,  ”  ”) 
seeder  <-  seedy  +  i 

j<_  gendatanormc (seeder,  out , dis, sigma) 

The  following  code  removes  the  outliers  if  the  standardized  residuals 
are  larger  than  2.5  for  a  fit  with  Simpson  and  Montgomery  estimator, 
smreg  <“  bijs5sa(j$x,  j$y) 

absres  <-  abs (smreg$residuals/smreg$scale) 

j$x  <-  j$x [absres  <2.5,  ] 

j$y  <-  j$y[absres  <  2.5] 


Find  SST  ,  ,  . . ^  . 

cvO  <“  sum((j$y  -  mean (j$y) ) ^2) /length (j$y) 

Get  cross-validation  estimates  of  prediction  error  for  1  var  model 
cvl  <-  cvpress ( j $x [ ,  1],  j$y,method=estimator) 


cv2  <-  cvpress (j$x[, 
cv3  <-  cvpress (j$x[, 
cv4  <-  cvpress (j$x [, 
cv5  <-  cvpress (j$x[, 
cv6  <-  cvpress (j$x [, 
cv7  <-  cvpress {j$x[, 
cv8  <-  cvpress (j$x[, 
cv9  <-  cvpress (j$x[. 


j$y,method*=estimator) 
j$y,method=estimator) 
1:4] ,  j$y,method=estimator) 
1:5] ,  j$y,method=estimator) 
1:6] ,  j$y,method=estimator) 
1:7] ,  j$y,method=estimator) 
1:8] ,  j$y,method=estimator) 
1:9] ,  j$y,method=estimator) 


1:2], 

1:3], 


Now  we  have  the  5  contending  models  with  2  measures  of  cross  validation 
prediction  error  for  each  alternative. 

cv  <-  matrix(c(cvO,  cvl$CV,  cv2$CV,  cv3$CV,  cv4$CV),nrow  =  1) 
cvwt  <-  matrix (c(cv0,  cvl$CVwt,  cv2$CVwt,  cv3$CVwt, 
cv4$CVwt) ,nrow  =  1) 

The  matrix  win.shao  has  a  1  entry  for  minimum  prediction  error 


otherwise  it  is  0. 

win.shao  <-  if else (cv  ==  min(cv),  1,  0) 
cijmwin.shao  <-  c\amwin.shao  +  win.shao 
win. shaowt  <-  ifelse(cvwt  ===  min(cvwt),  1,  0) 
cumwin . shaowt  <—  cumwin . shaowt  +  win. shaowt 
win.jw  finds  the  model  that  meets  the  change  in  prediction  error  criteria, 
win.jw  <-  win. pure  (cv,  const==.  0025) 
win.jwwt  <-  win. pure (cvwt, const-. 0005) 
idx  <-  rep(0,  5) 
idx[win.jw]  <-  1 
idxwt  <-  rep(0,  5) 
idxwt [win. jwwt] <-l 
cumwin. jw  <-  cumwin. jw  +  idx 
cumwin.jwwt  <-  cumwin.jwwt  +  idxwt 
} 


return (cv,  cumwin.shao,  cumwin. jw, cvwt, cumwin. shaowt, cumwin. jwwt) 


#  The  function  crossvald  computes  the  K-fold  and  adjusted  K-fold  estimates 

#  of  prediction  error.  It  also  returns  the  weighted  estimates  if  a  robust 

#  estimator  is  used. 

crossvald<-function(x,  y,  method  -  bijs5,  cvmse  =  function(y,  yhat) 
mean((y  -  yhat) ^2),  K  =  6) 

{ 

{ 

set .seed(129) 

X  <-  as. matrix (x) 
n  <-  nrow(x) 
out  <-  NULL 
f  <-  ceiling (n/K) 

#  Sample  without  replacement  from  a  vector  [1,  2,  3, . .K,  1,  2,..K...] 
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#  (there  are  f  repetitions  of  1,  2,  ...K]  to  identify  which  assessment  group 

#  the  observation  belongs.  The  sample  size  is  n.  The  assesment  sample 

#  sizes  should  be  close  to  one  another  because  we  resample  without 

#  replacement . 

s  <-  sample (rep (1 :K,  f),  n) 
y  <“  as. matrix (y,  ncol  =  1) 

regress  <-  method (x,  y)  #  find  predicted  values  for  the  model 
predvals  <-  y  -  regress$residuals 

#  Overall  resubstitution  error  is  corr.  This  is  the  initial  value 

#  required  to  compute  the  bias  correction  factor. 

corr  <-  cvmse(y,  predvals) 

CV  <~  0 
CVwt<-NULL 

xint  <-  matrix (cbind ( 1,  x) ,  nrow  =  n) 

#  For  each  assessment  set  S.as  compute  predicted  values 

pe<-matrix  ( 0 ,  ncol=l ,  nrow==n) 
for(i  in  1:K)  { 

#  Select  observations  with  index  i  for  the  assessment  set 

S.as  <-  c(l:n) [ (s  ==  i) ] 

#  The  training  set  is  all  the  observations  to  remain 

S.tr  <~  c(l:n) [ (s  !=  i) ] 

#  Perform  regression  with  the  current  training  set 

cvreg  <-  method (x [S.tr,  ],  y[S.tr]) 
predvals  <-  xint  %*%  cvreg$coef 

#  The  proportion  of  the  data  in  the  ith  assessment  set 

p. alpha  <~  length (S.as) /n 
pe[S.as]<-(y[S.as}-predvals[S.as] ) ^2 
pred.err  <-  cvmse (y [S . as] ,  predvals [S.as] ) 

CV  <“  CV  +  p. alpha  *  pred.err 

corr  <-  corr  “  p. alpha  *  cvmse (y,  predvals) 

CV.C  <“  CV  +  corr 

} 

#  calculate  the  weigted  avg  prediction  error  for  uncorrected  bootstrap 

wt<-diag (regress$weight,ncol=n) 

CVwt <-mean ( wt  %  *  %pe ) 

} 

return (CV,  corr,  CV.C,  cvreg$coef , CVwt ) 

#  cvksim  performs  the  simulations  for  K-fold  cross  validation.  It  is  set  up 

#  to  output  the  K-fold  prediction  error  and  the  weighted  K-fold.  It 

#  must  be  modified  to  do  least  squares  by  commenting  out  the  2  lines  in 

#  crossvald  for  wt  and  CVwt  and  change  CVwt  to  CV.C  in  cvksim  as  directed. 

#  It  is  set  up  for  a  5  variable  model,  change  ncol==  to  number  of  variables 

#  desired  and  add  the  quantities  to  assignments  in  ”cv”. 
cvksim<-f unction (iter,  out,  dis,  sigma, seedy, estimator) 

{ 

cumwin.shao  <-  matrix (0,  nrow  =  2,  ncol  ==  5) 
cumwin.jw  <-  matrix (0,  nrow  =  2,  ncol  =  5) 
for(i  in  l:iter)  { 

cat ("iter  ",  i,  ”  ”) 
seeder  <-  seedy  +  i 
set . seed (seeder) 

j  <-  gendatanormc (seeder,  out , dis, sigma) 

#  code  to  remove  outliers  first  with  Simpson  and  Montgomery  estimator 

#  smreg  <-  bijs5sa(j$x,  j$y) 

#  absres  <-  abs (smreg$residuals/smreg$scale) 

#  j$x  <-  j$x[absres  <  2.5,  ] 

#  j$y  j$y[absres  <  2.5] 

nn  <-  length (j$y) 

cvO  <-  sum((j$y  -  mean (j$y) ) ^2) /length (j$y) 

#  Get  cross-validation  estimates  of  prediction  error  for  1  var  model 

cvl  <-  crossvald (j$x[,  1],  j$y,  K  =  6) 
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#  2  variable  model  .  , 

cv2  <-  crossvald(j$x[,  1:2],  j$y,  method=estimator,  K  =  6) 

cv3  <-  crossvald(j$x[,  1:3],  j$y,  method=estimator,  K  =  6) 

cv4  <-  crossvald{j$x[,  1:4],  j$y,  method=estimator,  K  =  6) 

#  cv5  <-  crossvald(j$x[,  1:5],  j$y,  method=estimator,  K  =  6) 

#  cv6  <-  crossvald(j$x[,  1:6],  j$y,  method=estimator,  K  ==  6) 

#  cv7  <-  crossvald(j$xt,  1:7],  j$y,  method=estimator,  K  =  6) 

#  cv8  <-  crossvald(j$x[,  1:8],  j$y,  method-estimator,  K  =  6) 

#  cv9  <-  crossvald(j$x[,  1:9],  j$y,  method-estimator,  K  =  6) 

#  Now  we  have  the  5  contending  models  with  4  measures  of  cross  val  error 

#  for  each  one.  Put  them  in  a  4  by  5  matrix  to  find  out  the  best  model 

#  under  the  criterion.  Note  that  this  run  is  set  up  to  evaluate  the  K-fold 

#  and  the  weighted  K-fold.  Simply  replace  cv*$CVwt  with  cv*$CV.C  to  get 

#  the  bias  corrected  versions.  To  evaluate  more  than  5  variables  remove 

#  the  comments  from  cv#  above  and  add  those  variables  to  ”cv"  matrix. 

cv  <-  matrix {rbind(c(cvO,  cvl$CV,  cv2$CV,  cv3$CV,  cv4$ 

CV),  c(cvO,  cvl$CVwt,  cv2$CVwt,  cv3$CVwt,  cv4$CVwt)),  nrow  -  2) 

#  The  matrix  winners  has  a  1  entry  if  the  prediction  error  is  lowest 

#  0  otherwise. 

win-shao  <-  t(apply{cv,  1,  matmin) ) 

cumwin.shao  <-  cumwin.shao  +  win.shao 

winidx. jw  <-apply (cv, 1, win.pure, const=0 . 0025) 

cv.jw  <-  rep(0,  5) 

cv. jw[winidx. jw[l] ]  <-  1 

cvadj.jw  <-  rep(0,  5) 

cvadj . jw[winidx. jw[2] ]  <-  1 

win.jw  <-  matrix (rbind( cv.jw,  cvadj.jw),  nrow  -  2) 
cumwin.jw  <-  cumwin.jw  +  win.jw 

} 

} 

return (cumwin. shao,  cumwin.jw,  cv) 

} 

#  The  following  are  example  implementing  the  code 
date ( ) 

run3cvl<-cvlsim (50, 4,10,l,130,bijs5) 
run3cvl 

run5bs2<-shaosimgmr (25, 8,25,  5,  20,25, 155) 
run5bs2 

run8cvk0025<-cvksim (50, 8, 50, 5, 130, bi j  s5 ) 

run8cvk0025 

date ( ) 


